Log File Sync wait event due to CPU starvation based on Linux RT processes sharing the same CPU

Test system

  •    8 CPUs OEL 6.4 Oracle 11.2.0.4 RAC

Compare the 2 sessions above running on the same hardware and same CPU load

Session 1: 
Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                            Tota    Wait   % DB
Event                                 Waits Time Avg(ms)   time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
log file sync                         5,007 127.      23   93.9 Commit
DB CPU                                       6.6            5.3
gc current block busy                    34    1      29     .8 Cluster
db file sequential read                  27   .2       9     .2 User I/O
gc current block 2-way                  166   .2       1     .1 Cluster
db file scattered read                   19   .1       6     .1 User I/O
gc buffer busy release                   15   .1       6     .1 Cluster
gc current block 3-way                   69   .1       1     .0 Cluster
SQL*Net message to client            10,011    0       0     .0 Network
control file sequential read            112    0       0     .0 System I/O

Session2: 
Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                            Tota    Wait   % DB
Event                                 Waits Time Avg(ms)   time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
log file sync                         5,008 41.1       8   81.2 Commit
DB CPU                                       6.2           12.2
reliable message                          1    1    1031    2.0 Other
gc current block busy                    52   .9      17    1.8 Cluster
gc buffer busy release                   41   .4      10     .8 Cluster
control file sequential read            112   .1       1     .3 System I/O
db file sequential read                  15   .1       8     .2 User I/O
db file scattered read                   11   .1       7     .1 User I/O
gc cr block busy                          4   .1      14     .1 Cluster
SQL*Net message to client            10,011    0       0     .1 Network

 

  • AWR reports tells us we are running the same number of COMMITs ( 5000 )  but session2 is about 3x faster than session1.
  • Note also the overall CPU load is the same for both tests – in Session1 we only run a CPU intensive process on the same CPU as our lgwr process whereas in Session2 the CPU intensive process is running on a different CPU.

Running both sq load process and lgwr on CPU 5

Compile and run sq.c 
#include <math.h>
 double sqrt();
 main ()
 {
 int i; double z;
 printf("PID: %d\n", getpid());
 for (;;)
 {
 for (i = 0; i < 1000000; i++)
 {
 z = sqrt (i);
 }
 poll (0, 0, 1);
 }
 }
$ cc   sq.c   -o sq -lm

Attach sq process to CPU 5 and increase priority

$  taskset -c 5 sq
 PID: 12657
# renice -15 12657

Now reschedule the LGWR to CPU 5 and monitor both process with top
 $ ps -elf | egrep 'NI|ora_lgwr' |  grep -v grep
 F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
 0 S oracle    6181     1  0  80   0 - 394456 semtim 18:31 ?       00:00:01 ora_lgwr_grac41

$ taskset -pc 5 6181
 pid 6181's current affinity list: 0-7
 pid 6181's new affinity list: 5

Run the JDBC test case
 $ java UCPDemo_si 1 5000 -noseq
 Started at:  Fri Nov 15 19:12:43 CET 2013 - Number of Threads: 1 - tx_cnt:5000 - mode: 2 Debug mode: false
 -> SQL: insert into rac_perftest values (?, ?, ?, sysdate)
 Connected at: 19:12:45 to instance: grac41 at host: grac41.example.com
 Released  connection to pool - Instance: grac41 - Records inserted: 5000
 Runtime : 130 [s]


Monitor both processes using top
# top -H -p 12657,6181 ( press f and j to add CPU affinity flag P )
 Cpu(s): 15.6%us,  6.3%sy,  0.0%ni, 77.2%id,  0.4%wa,  0.0%hi,  0.5%si,  0.0%st
 PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
 12657 oracle     5 -15  6544  308  244 R 80.0  0.0   8:47.77 5 sq
 6181  oracle    20   0 1540m  32m  28m D  7.6  0.8   0:10.13 5 oracle
 --> P = 5 means both processes are running on CPU 5
     lgwr only gets around 7 % of CPU
     There is  a lot of CPU available ( 77% ) but lgwr still runs into CPU starvation
     sq process takes about 80% CPUtime from CPU 5 and runs with a higher priority as lgwr process

Running both sq load process and lgwr on different CPUs

Reschedule lgwr process from CPU 5 to CPU 7 
$ taskset -pc 7 6181 
 pid 6181's current affinity list: 5 
 pid 6181's new affinity list: 7 

Run test case:
$ java UCPDemo_si 1 5000 -noseq
 Started at:  Fri Nov 15 19:23:17 CET 2013 - Number of Threads: 1 - tx_cnt:5000 - mode: 2 Debug mode: false
 -> SQL: insert into rac_perftest values (?, ?, ?, sysdate)
 Connected at: 19:23:18 to instance: grac41 at host: grac41.example.com
 Released  connection to pool - Instance: grac41 - Records inserted: 5000
 Runtime : 45 [s]

Releated Top Output
 Cpu(s): 19.0%us,  9.0%sy,  0.0%ni, 70.6%id,  0.0%wa,  0.0%hi,  1.5%si,  0.0%st
 PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
 12657 oracle     5 -15  6544  308  244 S 81.5  0.0  17:50.60 5 sq
 6181  oracle    20   0 1540m  32m  28m D 12.0  0.8   0:26.33 7 oracle
 --> sq process and lgwr runs on different CPUs - lgwr process now takes 12% CPU
     sq process still takes about 81 % of CPU 7 but lgwr process now running on CPU 5 get enough CPU time 
     Application runtime drops from 130 s to 45 s without changing anything

 

Summary:

  • Always check that lgwr process gets enough CPU time –  total CPU shown in top output is not always the limiting factor
  • RT processes or lot of high priority process together with low number of CPUs can worthen the problem
  • First decrease priority for NON-Oracle RT processes ( resetting the priority of our sq load process fixes this problem  :  $ renice 0 12657 )
  • If this doesn’t help  try to renice LGWR oracle shadow processes ( increase priority )

 

Reference

http://kevinclosson.wordpress.com/2007/07/21/manly-men-only-use-solid-state-disk-for-redo-logging-lgwr-io-is-simple-but-not-lgwr-processing/

One thought on “Log File Sync wait event due to CPU starvation based on Linux RT processes sharing the same CPU”

  1. Attractive section of content. I just stumbled upon your blog
    and in accession capital to assert that I acquire in fact enjoyed account your blog posts.
    Any way I’ll be subscribing to your feeds and even I achievement you
    access consistently fast.

Leave a Reply

Your email address will not be published. Required fields are marked *