Display CHM data with oclumon

Used Software

  • GRID: 11.2.0.3.4
  • OEL 6.3
  • VirtualBox 4.2.14

Using oclumon to detect potential root causes for node evictions ( CPU starvation )

Using oclumon to monitor for CPU intensive application 
Monitor command: $ oclumon dumpnodeview -n grac2 -last "00:15:00"
----------------------------------------
Node: grac2 Clock: '08-19-13 20.22.26' SerialNo:356
----------------------------------------
SYSTEM:
#cpus: 1 cpu: 100.0;';3:Time=08-19-13 20.22.26, CPU usage on node grac2 (100.0%) is Very High (> 90%). 
131 processes are waiting for only 1 CPUs.' cpuq: 131 physmemfree: 915244 physmemtotal: 4055440 mcache: 1858816 
swapfree: 6373372 swaptotal: 6373372 ior: 44 iow: 116 ios: 26 swpin: 0 swpout: 0 pgin: 44 pgout: 68 
netr: 16.456 netw: 22.932 procs: 273 rtprocs: 11 #fds: 17184 #sysfdlimit: 6815744 #disks: 11 #nics: 3  
nicErrors: 0
TOP CONSUMERS:
topcpu: 'mp_cpu(6822) 69.95' topprivmem: 'ologgerd(5945) 86196' topshm: 'oracle(5201) 105600' 
topfd: 'ohasd.bin(2852) 720' topthread: 'mp_cpu(6822) 129'

Summary from  above oclumon report:

  •  System runs with a single CPU and 100 % CPU load
  •  The program mp_cpu is taking about 70 % of our CPU and runs 129 threads
  •  Even running that program mp_cpu  multiple hours the RAC system doesn’t crash with any node eviction
  • If the sample program mp_cpu is written as  a realtime program ( sched_setscheduler(0, SCHED_FIFO, .. ) olcumon may stop working with following errors :  Waiting upto 300 secs for backend…  CRS-9103-No data available
  • In case we have a RT process taking all the CPU oclumon may skip most of the records ( use top in that case to monitor your system )
  • For CPU problems oclumon reports: CPU usage on node grac2 (100.0%) is Very High (> 90%)

Using oclumon to detect potential root causes for node evictions ( low swap space )

Using oclumon to monitor for MEMORY leaking application 
Stage 1 : Normal running system - stable swapfree parameter wiht about 4.6 Gbyte
----------------------------------------
Node: grac1 Clock: '08-20-13 09.27.33' SerialNo:743
----------------------------------------
SYSTEM:
#cpus: 1 cpu: 18.27 cpuq: 4 physmemfree: 2981396 physmemtotal: 4055440 mcache: 341364 swapfree: 4641208 
swaptotal: 6373372 ior: 8862 iow: 567 ios: 733 swpin: 1092 swpout: 0 pgin: 4447 pgout: 293 
netr: 46.093 netw: 64.767 procs: 285 rtprocs: 11 #fds: 17184 #sysfdlimit: 6815744 #disks: 12 #nics: 3  
nicErrors: 0
TOP CONSUMERS:
topcpu: 'oracle(4622) 3.59' topprivmem: 'ologgerd(3415) 88000' topshm: 'ologgerd(3415) 59392' 
topfd: 'ohasd.bin(2901) 713' topthread: 'console-kit-dae(2639) 64'

Stage 2: A process ( mp_mem ) starts to eat up our memory-  swapfree parameter is increasing 
( current value 4.1 Gbyte )  
----------------------------------------
Node: grac1 Clock: '08-20-13 09.29.38' SerialNo:768
----------------------------------------
SYSTEM:
#cpus: 1 cpu: 19.7 cpuq: 4 physmemfree: 91252 physmemtotal: 4055440 mcache: 157784 swapfree: 4103960 
swaptotal: 6373372 ior: 17994 iow: 38462 ios: 590 swpin: 486 swpout: 19465 pgin: 9112 pgout: 19619 
netr: 26.962 netw: 13.023 procs: 285 rtprocs: 11 #fds: 16928 #sysfdlimit: 6815744 #disks: 12 #nics: 3  
nicErrors: 0
TOP CONSUMERS:
topcpu: 'oracle(4622) 3.99' topprivmem: 'mp_mem(8530) 3084812' topshm: 'ologgerd(3415) 57168' 
topfd: 'ohasd.bin(2901) 714' topthread: 'mp_mem
(8530) 85'

Stage 3 : OS is running out of swap space - OS may detect a process and  kill that application 
Monitor command: $ oclumon dumpnodeview -n grac1 -last "00:15:00"
Node: grac1 Clock: '08-20-13 09.37.28' SerialNo:862
----------------------------------------
SYSTEM:
#cpus: 1 cpu: 18.67 cpuq: 2 physmemfree: 125904;';3:Time=08-20-13 09.37.28, 
Available memory (physmemfree 125904 KB + swapfree 34652 KB) on node grac1 is Too Low 
(< 10% of total-mem + total-swap)' physmemtotal: 4055440 mcache: 171264 swapfree: 34652 
swaptotal: 6373372 ior: 17330 iow: 3462 ios: 602 swpin: 704 swpout: 1880 
pgin: 8662 pgout: 1957 netr: 30.729 netw: 26.343 procs: 286 rtprocs: 11 #fds: 16640 
#sysfdlimit: 6815744 #disks: 12 #nics: 3  nicErrors: 0
TOP CONSUMERS:
topcpu: 'oracle(3879) 3.59' topprivmem: 'mp_mem(8530) 2966308' topshm: 'ologgerd(3415) 59212' 
topfd: 'ohasd.bin(2901) 714' topthread: 'mp_mem(8530) 129'

 

In the above case the program mp_mem was killed by Linux. You can read the following article on how Linux detects a process for termination due MEMORY shortage:

 

Using oclumon to detect potential root causes for node evictions ( Network problem )

netrr :          Average network receive rate within the current sample interval (KB per second)
neteff:          Average effective bandwidth within the current sample interval (KB per second)
nicerrors:     Average error rate within the current sample interval (errors per second)

eth2 netrr: 21.005  netwr: 17.449  neteff: 38.454  nicerrors: 0 pktsin: 40  pktsout: 37  errsin: 0  
  errsout: 0  indiscarded: 0  outdiscarded: 0  inunicast : 39  innonunicast: 1  type: PRIVATE latency: <1
Node: grac42 Clock: '03-05-14 16.01.04' SerialNo:30728 
eth2 netrr: 14.823  netwr: 16.298  neteff: 31.121  nicerrors: 0 pktsin: 32  pktsout: 34  errsin: 0  
   errsout: 0  indiscarded: 0  outdiscarded: 0  inunicast: 30  innonunicast: 2  type: PRIVATE latency: <1

Node: grac42 Clock: '03-05-14 16.01.14' SerialNo:30730 
eth2 netrr: 0.000  netwr: 0.000  neteff: 0.000  nicerrors: 0 pktsin: 0  pktsout: 0  errsin: 0  
errsout: 0  indiscarded: 0  outdiscarded: 0  inunicast: 0  innonunicast: 0  type: PRIVATE latency: <1

Node: grac42 Clock: '03-05-14 16.01.19' SerialNo:30731 
eth2 netrr: 0.000  netwr: 0.000  neteff: 0.000  nicerrors: 0 pktsin: 0  pktsout: 0  errsin: 0  
errsout: 0  indiscarded: 0  outdiscarded: 0  inunicast: 0  innonunicast: 0  type: PRIVATE latency: <1
Node: grac42 Clock: '03-05-14 16.01.24' SerialNo:30732
  • at 03-05-14 16.01.14 the network activity for eth2 of our cluster interconnect drops
  • As eth2 is our cluster interconnect we saw a Instance Eviction later on

 

Summary  Node Eviction

  • 11.2.0.3 seems to be quite stable for CPU and Memory starvation ( reduced number of node evictions )
  • Out of memory scenarios may be handled by the Linux kernel by killing certain processes
  • Both CPU and Memory starvation should be solved asap as cluster performance my drop dramatically
  • For Memory problems oclumon reports:  Available memory (physmemfree 125904 KB + swapfree 34652 KB) on node grac1 is Too Low (< 10% of total-mem + total-swap)’

Leave a Reply

Your email address will not be published. Required fields are marked *