Debugging a RAC Node Eviction by reading LMBH trace file

Overview

  • LMBH process reports any  heartbeat problems of LMON, LMD, and LMSn processes
  • Seems LMHB tracks whether  LMON, LMD, and LMSn processes has reported any wait states
  • If not LMHB assumes there is a problem and reports this in lmbh trace file
  • LMBH trace reports low memory and swap space problem
  • LMBH reports system average load data

Typical LMHB trace file leading to a Note Eviction due to Memory/CPU starvation

*** 2014-03-22 10:03:47.024
==============================
LMD0 (ospid: 27060) has not moved for 333 sec (1395479026.1395478693)
kjfmGCR_HBCheckAll: LMD0 (ospid: 27060) has status 2
  : Not in wait; last wait ended 325 secs ago.
  : last wait_id 811409 at 'ges remote message'.
  ==============================
  Dumping PROCESS LMD0 (ospid: 27060) States
  ==============================
  ===[ Callstack ]===

*** 2014-03-22 10:03:47.025
Process diagnostic dump for oracle@grac41.example.com (LMD0), OS id=27060,
pid: 12, proc_ser: 1, sid: 6, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
  0.000000 secs at [ 10:03:46 ]
    NOTE: scheduling delay has not been sampled for 0.294992 secs  0.000000 secs from [ 10:03:42 - 10:03:47 ], 5 sec avg
  0.054111 secs from [ 10:03:13 - 10:03:47 ], 1 min avg
  0.314816 secs from [ 09:58:47 - 10:03:47 ], 5 min avg

*** 2014-03-22 10:04:33.237
loadavg : 96.87 81.60 42.80

*** 2014-03-22 10:04:36.619
Memory (Avail / Total) = 85.77M / 4252.24M
Swap (Avail / Total) = 105.59M /  5088.00M
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 27060' | /bin/grep -v grep timed out after 2.500 seconds
Skipping stack dump because max dump time exceeded.

-------------------------------------------------------------------------------

*** 2014-03-22 10:04:38.783
==============================
LMS0 (ospid: 27131) has not moved for 420 sec (1395479077.1395478657)
kjfmGCR_HBCheckAll: LMS0 (ospid: 27131) has status 2
  : Not in wait; last wait ended 375 secs ago.
  : last wait_id 2601414 at 'gcs remote message'.
  ==============================
  Dumping PROCESS LMS0 (ospid: 27131) States
  ==============================
  ===[ Callstack ]===

*** 2014-03-22 10:04:39.220
Process diagnostic dump for oracle@grac41.example.com (LMS0), OS id=27131,

*** 2014-03-22 10:04:41.227
pid: 13, proc_ser: 1, sid: 131, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)

*** 2014-03-22 10:04:43.375
  0.000000 secs at [ 10:04:41 ]
    NOTE: scheduling delay has not been sampled for 1.798665 secs  0.000000 secs from [ 10:04:38 - 10:04:43 ], 5 sec avg
  0.002454 secs from [ 10:03:43 - 10:04:43 ], 1 min avg
  0.166610 secs from [ 10:00:17 - 10:04:43 ], 5 min avg
loadavg : 100.09 85.92 46.55
Memory (Avail / Total) = 3161.91M / 4252.24M
Swap (Avail / Total) = 2518.66M /  5088.00M
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 D oracle   27131     1  0  58   - - 392355 sleep_ Mar21 ?       00:07:41 ora_lms0_grac41

*** 2014-03-22 10:04:52.964
Short stack dump: ORA-32516: cannot wait for process 'Unix process pid: 27131, image: oracle@grac41.example.com (LMS0)' 
 to finish executing ORADEBUG command 'SHORT_STACK'; wait time exceeds 2820 ms

-------------------------------------------------------------------------------
Process diagnostic dump actual duration=11.990000 sec
  (max dump time=5.000000 sec)

--> LMDO ( 333 seconds )  and LMS0 ( 375 seconds ) process are not progressing for a very long time        
    LMD0 and LMS0 are in disk wait status ( verly likely due to paging/swapping activities )
    LMHB reports low memory and low swap space  around 2014-03-22 10:04:36
        *** 2014-03-22 10:04:36.619
        Memory (Avail / Total) = 85.77M / 4252.24M
        Swap (Avail / Total) = 105.59M /  5088.00M
    Interpreting Load stats     
        loadavg : 100.09 85.92 46.55
         during the last 60 s  about 100.09 threads are concurrently running ( Run queue or waiting on Disk I/0)
         during the last  5 min about 85.92 threads are concurrently running ( Run queue or waiting on Disk I/0)
         during the last 15 min about 46.55 threads are concurrently running ( Run queue or waiting on Disk I/0)
      --> Load increases 2x in the last 15 minutes !

Leave a Reply

Your email address will not be published. Required fields are marked *