Node Eviction due to failed NHB ( Network Heartbeat )


Triggering a NHB Note Eviction on Node grac2

$ ifconfig eth1 down
You can increase CSSD log level to 3 to trace each  each heartbeat message: ( do this on all involved notes )
#  $GRID_HOME/bin/crsctl set  log css CSSD:3
Set CSSD Module: CSSD  Log Level: 3
#   $GRID_HOME/bin/crsctl get log css CSSD
Get CSSD Module: CSSD  Log Level: 3

Check the current CSS timeout values
#  $GRID_HOME/bin/crsctl  get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.

Review CRS Logfile 

  •  $GRID_HOME/log/grac1/cssd/ocssd.log
  • $GRID_HOME/log/grac2/cssd/ocssd.log
  • $GRID_HOME/log/grac1/alertgrac1.log
  • $GRID_HOME/log/grac2/alertgrac2.log
  • alert_GRACE21.log
  • alert_GRACE22.log

Check ocssd.log on both RAC Nodes
Node grac2

$ cat $GRID_HOME/log/grac2/cssd/ocssd.log | egrep -i 'Removal|evict|30000|network HB'
2013-08-01 15:15:10.500: [    CSSD][380536576]clssnmHBInfo: css timestmp 1375362910 500 slgtime 13000084 DTO 27750 (index=1) biggest misstime 110 NTO 28390
2013-08-02 09:52:43.287: [    CSSD][382113536]clssnmPollingThread: node grac1 (1) at 50% heartbeat fatal, removal in 14.960 seconds
2013-08-02 09:52:51.294: [    CSSD][382113536]clssnmPollingThread: node grac1 (1) at 75% heartbeat fatal, removal in 6.950 seconds
2013-08-02 09:52:55.296: [    CSSD][382113536]clssnmPollingThread: node grac1 (1) at 90% heartbeat fatal, removal in 2.950 seconds, seedhbimpd 1
2013-08-02 09:52:55.296: [    CSSD][382113536]clssnmPollingThread: node grac1 (1) at 90% heartbeat fatal, removal in 2.950 seconds, seedhbimpd 1
2013-08-02 09:52:55.297: [    CSSD][388421376]clssnmvDHBValidateNCopy: node 1, grac1, has a disk HB, but no network HB, DHB has rcfg 269544449, wrtcnt, 547793, LATS 2298
6504, lastSeqNo 547787, uniqueness 1375343446, timestamp 1375429975/22995684
2013-08-02 09:52:55.298: [    CSSD][389998336](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node grac1, number 1, sync 269544449, stamp 22996044
2013-08-02 09:52:55.299: [    CSSD][399468288](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node grac1, number 1, sync 269544449, stamp 22996044
2013-08-02 09:52:55.301: [    CSSD][394729216](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node grac1, number 1, sync 269544449, stamp 22996044
2013-08-02 09:52:56.297: [    CSSD][388421376]clssnmvDHBValidateNCopy: node 1, grac1, has a disk HB, but no network HB, DHB has rcfg 269544449, wrtcnt, 547799, LATS 2298
7504, lastSeqNo 547793, uniqueness 1375343446, timestamp 1375429976/22996684
2013-08-02 09:52:57.300: [    CSSD][388421376]clssnmvDHBValidateNCopy: node 1, grac1, has a disk HB, but no network HB, DHB has rcfg 269544449, wrtcnt, 547805, LATS 2298
8504, lastSeqNo 547799, uniqueness 1375343446, timestamp 1375429977/22997694
2013-08-02 09:53:23.461: [    CSSD][3078596352]clssnmSetParamsFromConfig: remote SIOT 27000, local SIOT 27000, LIOT 200000, misstime 30000, reboottime 3000, impending mi
sstime 15000, voting file reopen delay 4000

GRID_HOME/log/grac2/alertgrac2.log
[cssd(2869)]CRS-1611:Network communication with node grac1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.950 seconds
2013-08-02 09:52:55.296
[cssd(2869)]CRS-1610:Network communication with node grac1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.950 seconds
2013-08-02 09:52:55.298
[cssd(2869)]CRS-1608:This node was evicted by node 1, grac1; details at (:CSSNM00005:) in /u01/app/11203/grid/log/grac2/cssd/ocssd.log.
2013-08-02 09:52:55.298
[cssd(2869)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11203/grid/log/grac2/cssd/ocssd.log
2013-08-02 09:52:55.298
[cssd(2869)]CRS-1652:Starting clean up of CRSD resources.
2013-08-02 09:52:55.298
[cssd(2869)]CRS-1608:This node was evicted by node 1, grac1; details at (:CSSNM00005:) in /u01/app/11203/grid/log/grac2/cssd/ocssd.log.
2013-08-02 09:52:55.301
[cssd(2869)]CRS-1608:This node was evicted by node 1, grac1; details at (:CSSNM00005:) in /u01/app/11203/grid/log/grac2/cssd/ocssd.log.
2013-08-02 09:52:56.573
[/u01/app/11203/grid/bin/oraagent.bin(3255)]CRS-5016:Process "/u01/app/11203/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11203/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11203/grid/log/grac2/agent/crsd/oraagent_grid/oraagent_grid.log"

Node grac1

$ cat $GRID_HOME/log/grac2/cssd/ocssd.log | egrep 'Removal|evict|30000'
2013-08-01 15:15:10.500: [    CSSD][380536576]clssnmHBInfo: css timestmp 1375362910 500 slgtime 13000084 DTO 27750 (index=1) biggest misstime 110 NTO 28390
2013-08-02 09:52:55.298: [    CSSD][389998336](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node grac1, number 1, sync 269544449, stamp 22996044
2013-08-02 09:52:55.299: [    CSSD][399468288](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node grac1, number 1, sync 269544449, stamp 22996044
2013-08-02 09:52:55.301: [    CSSD][394729216](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node grac1, number 1, sync 269544449, stamp 22996044
2013-08-02 09:53:23.461: [    CSSD][3078596352]clssnmSetParamsFromConfig: remote SIOT 27000, local SIOT 27000, LIOT 200000, misstime 30000, reboottime 3000, impending misstime 15000, voting file reopen delay 4000

cat $GRID_HOME/log/grac1/alertgrac1.log
cssd(2933)]CRS-1612:Network communication with node grac2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.630 seconds
2013-08-02 09:52:49.504
[cssd(2933)]CRS-1611:Network communication with node grac2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.630 seconds
2013-08-02 09:52:53.504
[cssd(2933)]CRS-1610:Network communication with node grac2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.630 seconds
2013-08-02 09:52:56.141
[cssd(2933)]CRS-1607:Node grac2 is being evicted in cluster incarnation 269544449; details at (:CSSNM00007:) in /u01/app/11203/grid/log/grac1/cssd/ocssd.log.
2013-08-02 09:52:59.148
[cssd(2933)]CRS-1625:Node grac2, number 2, was manually shut down
2013-08-02 09:52:59.160
[cssd(2933)]CRS-1601:CSSD Reconfiguration complete. Active nodes are grac1 .
2013-08-02 09:52:59.176
[crsd(3205)]CRS-5504:Node down event reported for node 'grac2'.
2013-08-02 09:53:08.632
[crsd(3205)]CRS-2773:Server 'grac2' has been removed from pool 'Generic'.
2013-08-02 09:53:08.632
[crsd(3205)]CRS-2773:Server 'grac2' has been removed from pool 'ora.GRACE2'.

--> Here we know that grac2 was evicted by grac1. Verify this by reading related alert.log files

Review alert.logs

alert_GRACE21.log  ( ASM termination )
Note grac2: ASM instance gets terminated
SKGXP: ospid 3625: network interface with IP address 169.254.86.205 no longer running (check cable)
SKGXP: ospid 3625: network interface with IP address 169.254.86.205 is DOWN
Fri Aug 02 09:52:57 2013
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/grace2/GRACE22/trace/GRACE22_asmb_3649.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 31 Serial number: 7
Errors in file /u01/app/oracle/diag/rdbms/grace2/GRACE22/trace/GRACE22_asmb_3649.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 31 Serial number: 7
ASMB (ospid: 3649): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 3649

alert_GRACE22.log ( Instance recovery and resource reallocation 
Node grac1 Global resources are recovered  and Instance Recovery takes place: 
ri Aug 02 09:53:00 2013
Reconfiguration started (old inc 4, new inc 6)
List of instances:
 1 (myinst: 1) 
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE 
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Fri Aug 02 09:53:00 2013
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
Fri Aug 02 09:53:00 2013
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Started redo scan
Completed redo scan
 read 6 KB redo, 4 data blocks need recovery
Started redo application at
 Thread 2: logseq 5, block 26368
Recovery of Online Redo Log: Thread 2 Group 3 Seq 5 Reading mem 0
  Mem# 0: +DATA/grace2/onlinelog/group_3.266.822058427
Completed redo application of 0.00MB
Completed instance recovery at
 Thread 2: logseq 5, block 26380, scn 1541989
 4 data blocks read, 4 data blocks written, 6 redo k-bytes read
Thread 2 advanced to log sequence 6 (thread recovery)
Fri Aug 02 09:53:02 2013
minact-scn: Master considers inst:2 dead

Lets have a short look on Cluster Health monitor files ( CHM)

ev1_grac2_eth1_p1

Summary

  • Node grac2 stops reporting after 9:53 as we have stopped the Cluster interconnect interface eht1 on grac2
  • Note grac1 shows that all Network activies drops to 0 after 9:53  ( see netrr, netwr,   neteff )

 

3 thoughts on “Node Eviction due to failed NHB ( Network Heartbeat )”

Leave a Reply

Your email address will not be published. Required fields are marked *