Debugging a RAC Node Eviction by reading LMBH trace file

Table of Contents

Overview

LMBH process reports any heartbeat problems of LMON, LMD, and LMSn processes
Seems LMHB tracks whether LMON, LMD, and LMSn processes has reported any wait states
If not LMHB assumes there is a problem and reports this in lmbh trace file
LMBH trace reports low memory and swap space problem
LMBH reports system average load data

Typical LMHB trace file leading to a Note Eviction due to Memory/CPU starvation

*** 2014-03-22 10:03:47.024
==============================
LMD0 (ospid: 27060) has not moved for 333 sec (1395479026.1395478693)
kjfmGCR_HBCheckAll: LMD0 (ospid: 27060) has status 2
  : Not in wait; last wait ended 325 secs ago.
  : last wait_id 811409 at 'ges remote message'.
  ==============================
  Dumping PROCESS LMD0 (ospid: 27060) States
  ==============================
  ===[ Callstack ]===

*** 2014-03-22 10:03:47.025
Process diagnostic dump for oracle@grac41.example.com (LMD0), OS id=27060,
pid: 12, proc_ser: 1, sid: 6, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
  0.000000 secs at [ 10:03:46 ]
    NOTE: scheduling delay has not been sampled for 0.294992 secs  0.000000 secs from [ 10:03:42 - 10:03:47 ], 5 sec avg
  0.054111 secs from [ 10:03:13 - 10:03:47 ], 1 min avg
  0.314816 secs from [ 09:58:47 - 10:03:47 ], 5 min avg

*** 2014-03-22 10:04:33.237
loadavg : 96.87 81.60 42.80

*** 2014-03-22 10:04:36.619
Memory (Avail / Total) = 85.77M / 4252.24M
Swap (Avail / Total) = 105.59M /  5088.00M
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 27060' | /bin/grep -v grep timed out after 2.500 seconds
Skipping stack dump because max dump time exceeded.

-------------------------------------------------------------------------------

*** 2014-03-22 10:04:38.783
==============================
LMS0 (ospid: 27131) has not moved for 420 sec (1395479077.1395478657)
kjfmGCR_HBCheckAll: LMS0 (ospid: 27131) has status 2
  : Not in wait; last wait ended 375 secs ago.
  : last wait_id 2601414 at 'gcs remote message'.
  ==============================
  Dumping PROCESS LMS0 (ospid: 27131) States
  ==============================
  ===[ Callstack ]===

*** 2014-03-22 10:04:39.220
Process diagnostic dump for oracle@grac41.example.com (LMS0), OS id=27131,

*** 2014-03-22 10:04:41.227
pid: 13, proc_ser: 1, sid: 131, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)

*** 2014-03-22 10:04:43.375
  0.000000 secs at [ 10:04:41 ]
    NOTE: scheduling delay has not been sampled for 1.798665 secs  0.000000 secs from [ 10:04:38 - 10:04:43 ], 5 sec avg
  0.002454 secs from [ 10:03:43 - 10:04:43 ], 1 min avg
  0.166610 secs from [ 10:00:17 - 10:04:43 ], 5 min avg
loadavg : 100.09 85.92 46.55
Memory (Avail / Total) = 3161.91M / 4252.24M
Swap (Avail / Total) = 2518.66M /  5088.00M
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 D oracle   27131     1  0  58   - - 392355 sleep_ Mar21 ?       00:07:41 ora_lms0_grac41

*** 2014-03-22 10:04:52.964
Short stack dump: ORA-32516: cannot wait for process 'Unix process pid: 27131, image: oracle@grac41.example.com (LMS0)' 
 to finish executing ORADEBUG command 'SHORT_STACK'; wait time exceeds 2820 ms

-------------------------------------------------------------------------------
Process diagnostic dump actual duration=11.990000 sec
  (max dump time=5.000000 sec)

--> LMDO ( 333 seconds )  and LMS0 ( 375 seconds ) process are not progressing for a very long time        
    LMD0 and LMS0 are in disk wait status ( verly likely due to paging/swapping activities )
    LMHB reports low memory and low swap space  around 2014-03-22 10:04:36
        *** 2014-03-22 10:04:36.619
        Memory (Avail / Total) = 85.77M / 4252.24M
        Swap (Avail / Total) = 105.59M /  5088.00M
    Interpreting Load stats     
        loadavg : 100.09 85.92 46.55
         during the last 60 s  about 100.09 threads are concurrently running ( Run queue or waiting on Disk I/0)
         during the last  5 min about 85.92 threads are concurrently running ( Run queue or waiting on Disk I/0)
         during the last 15 min about 46.55 threads are concurrently running ( Run queue or waiting on Disk I/0)
      --> Load increases 2x in the last 15 minutes !

Overview

Typical LMHB trace file leading to a Note Eviction due to Memory/CPU starvation

Leave a Reply Cancel reply