Node Eviction due to failed NHB ( Network Heartbeat )

Table of Contents

Triggering a NHB Note Eviction on Node grac2

$ ifconfig eth1 down
You can increase CSSD log level to 3 to trace each each heartbeat message: ( do this on all involved notes )
# $GRID_HOME/bin/crsctl set log css CSSD:3
Set CSSD Module: CSSD Log Level: 3
# $GRID_HOME/bin/crsctl get log css CSSD
Get CSSD Module: CSSD Log Level: 3

Check the current CSS timeout values
# $GRID_HOME/bin/crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.

Review CRS Logfile

$GRID_HOME/log/grac1/cssd/ocssd.log
$GRID_HOME/log/grac2/cssd/ocssd.log
$GRID_HOME/log/grac1/alertgrac1.log
$GRID_HOME/log/grac2/alertgrac2.log
alert_GRACE21.log
alert_GRACE22.log

Check ocssd.log on both RAC Nodes
Node grac2

$ cat $GRID_HOME/log/grac2/cssd/ocssd.log | egrep -i 'Removal|evict|30000|network HB'
2013-08-01 15:15:10.500: [    CSSD][380536576]clssnmHBInfo: css timestmp 1375362910 500 slgtime 13000084 DTO 27750 (index=1) biggest misstime 110 NTO 28390
2013-08-02 09:52:43.287: [    CSSD][382113536]clssnmPollingThread: node grac1 (1) at 50% heartbeat fatal, removal in 14.960 seconds
2013-08-02 09:52:51.294: [    CSSD][382113536]clssnmPollingThread: node grac1 (1) at 75% heartbeat fatal, removal in 6.950 seconds
2013-08-02 09:52:55.296: [    CSSD][382113536]clssnmPollingThread: node grac1 (1) at 90% heartbeat fatal, removal in 2.950 seconds, seedhbimpd 1
2013-08-02 09:52:55.296: [    CSSD][382113536]clssnmPollingThread: node grac1 (1) at 90% heartbeat fatal, removal in 2.950 seconds, seedhbimpd 1
2013-08-02 09:52:55.297: [    CSSD][388421376]clssnmvDHBValidateNCopy: node 1, grac1, has a disk HB, but no network HB, DHB has rcfg 269544449, wrtcnt, 547793, LATS 2298
6504, lastSeqNo 547787, uniqueness 1375343446, timestamp 1375429975/22995684
2013-08-02 09:52:55.298: [    CSSD][389998336](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node grac1, number 1, sync 269544449, stamp 22996044
2013-08-02 09:52:55.299: [    CSSD][399468288](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node grac1, number 1, sync 269544449, stamp 22996044
2013-08-02 09:52:55.301: [    CSSD][394729216](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node grac1, number 1, sync 269544449, stamp 22996044
2013-08-02 09:52:56.297: [    CSSD][388421376]clssnmvDHBValidateNCopy: node 1, grac1, has a disk HB, but no network HB, DHB has rcfg 269544449, wrtcnt, 547799, LATS 2298
7504, lastSeqNo 547793, uniqueness 1375343446, timestamp 1375429976/22996684
2013-08-02 09:52:57.300: [    CSSD][388421376]clssnmvDHBValidateNCopy: node 1, grac1, has a disk HB, but no network HB, DHB has rcfg 269544449, wrtcnt, 547805, LATS 2298
8504, lastSeqNo 547799, uniqueness 1375343446, timestamp 1375429977/22997694
2013-08-02 09:53:23.461: [    CSSD][3078596352]clssnmSetParamsFromConfig: remote SIOT 27000, local SIOT 27000, LIOT 200000, misstime 30000, reboottime 3000, impending mi
sstime 15000, voting file reopen delay 4000

GRID_HOME/log/grac2/alertgrac2.log
[cssd(2869)]CRS-1611:Network communication with node grac1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.950 seconds
2013-08-02 09:52:55.296
[cssd(2869)]CRS-1610:Network communication with node grac1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.950 seconds
2013-08-02 09:52:55.298
[cssd(2869)]CRS-1608:This node was evicted by node 1, grac1; details at (:CSSNM00005:) in /u01/app/11203/grid/log/grac2/cssd/ocssd.log.
2013-08-02 09:52:55.298
[cssd(2869)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11203/grid/log/grac2/cssd/ocssd.log
2013-08-02 09:52:55.298
[cssd(2869)]CRS-1652:Starting clean up of CRSD resources.
2013-08-02 09:52:55.298
[cssd(2869)]CRS-1608:This node was evicted by node 1, grac1; details at (:CSSNM00005:) in /u01/app/11203/grid/log/grac2/cssd/ocssd.log.
2013-08-02 09:52:55.301
[cssd(2869)]CRS-1608:This node was evicted by node 1, grac1; details at (:CSSNM00005:) in /u01/app/11203/grid/log/grac2/cssd/ocssd.log.
2013-08-02 09:52:56.573
[/u01/app/11203/grid/bin/oraagent.bin(3255)]CRS-5016:Process "/u01/app/11203/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11203/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11203/grid/log/grac2/agent/crsd/oraagent_grid/oraagent_grid.log"

Node grac1

$ cat $GRID_HOME/log/grac2/cssd/ocssd.log | egrep 'Removal|evict|30000'
2013-08-01 15:15:10.500: [    CSSD][380536576]clssnmHBInfo: css timestmp 1375362910 500 slgtime 13000084 DTO 27750 (index=1) biggest misstime 110 NTO 28390
2013-08-02 09:52:55.298: [    CSSD][389998336](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node grac1, number 1, sync 269544449, stamp 22996044
2013-08-02 09:52:55.299: [    CSSD][399468288](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node grac1, number 1, sync 269544449, stamp 22996044
2013-08-02 09:52:55.301: [    CSSD][394729216](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node grac1, number 1, sync 269544449, stamp 22996044
2013-08-02 09:53:23.461: [    CSSD][3078596352]clssnmSetParamsFromConfig: remote SIOT 27000, local SIOT 27000, LIOT 200000, misstime 30000, reboottime 3000, impending misstime 15000, voting file reopen delay 4000

cat $GRID_HOME/log/grac1/alertgrac1.log
cssd(2933)]CRS-1612:Network communication with node grac2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.630 seconds
2013-08-02 09:52:49.504
[cssd(2933)]CRS-1611:Network communication with node grac2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.630 seconds
2013-08-02 09:52:53.504
[cssd(2933)]CRS-1610:Network communication with node grac2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.630 seconds
2013-08-02 09:52:56.141
[cssd(2933)]CRS-1607:Node grac2 is being evicted in cluster incarnation 269544449; details at (:CSSNM00007:) in /u01/app/11203/grid/log/grac1/cssd/ocssd.log.
2013-08-02 09:52:59.148
[cssd(2933)]CRS-1625:Node grac2, number 2, was manually shut down
2013-08-02 09:52:59.160
[cssd(2933)]CRS-1601:CSSD Reconfiguration complete. Active nodes are grac1 .
2013-08-02 09:52:59.176
[crsd(3205)]CRS-5504:Node down event reported for node 'grac2'.
2013-08-02 09:53:08.632
[crsd(3205)]CRS-2773:Server 'grac2' has been removed from pool 'Generic'.
2013-08-02 09:53:08.632
[crsd(3205)]CRS-2773:Server 'grac2' has been removed from pool 'ora.GRACE2'.

--> Here we know that grac2 was evicted by grac1. Verify this by reading related alert.log files

Review alert.logs

alert_GRACE21.log  ( ASM termination )
Note grac2: ASM instance gets terminated
SKGXP: ospid 3625: network interface with IP address 169.254.86.205 no longer running (check cable)
SKGXP: ospid 3625: network interface with IP address 169.254.86.205 is DOWN
Fri Aug 02 09:52:57 2013
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/grace2/GRACE22/trace/GRACE22_asmb_3649.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 31 Serial number: 7
Errors in file /u01/app/oracle/diag/rdbms/grace2/GRACE22/trace/GRACE22_asmb_3649.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 31 Serial number: 7
ASMB (ospid: 3649): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 3649

alert_GRACE22.log ( Instance recovery and resource reallocation 
Node grac1 Global resources are recovered  and Instance Recovery takes place: 
ri Aug 02 09:53:00 2013
Reconfiguration started (old inc 4, new inc 6)
List of instances:
 1 (myinst: 1) 
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE 
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Fri Aug 02 09:53:00 2013
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
Fri Aug 02 09:53:00 2013
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Started redo scan
Completed redo scan
 read 6 KB redo, 4 data blocks need recovery
Started redo application at
 Thread 2: logseq 5, block 26368
Recovery of Online Redo Log: Thread 2 Group 3 Seq 5 Reading mem 0
  Mem# 0: +DATA/grace2/onlinelog/group_3.266.822058427
Completed redo application of 0.00MB
Completed instance recovery at
 Thread 2: logseq 5, block 26380, scn 1541989
 4 data blocks read, 4 data blocks written, 6 redo k-bytes read
Thread 2 advanced to log sequence 6 (thread recovery)
Fri Aug 02 09:53:02 2013
minact-scn: Master considers inst:2 dead

Lets have a short look on Cluster Health monitor files ( CHM)

Summary

Node grac2 stops reporting after 9:53 as we have stopped the Cluster interconnect interface eht1 on grac2
Note grac1 shows that all Network activies drops to 0 after 9:53 ( see netrr, netwr, neteff )

3 thoughts on “Node Eviction due to failed NHB ( Network Heartbeat )”

What’s up it’s me, I am also visiting this web site daily, this web page is genuinely
good and the people are genuinely sharing nice thoughts.

Hi peer,
Can we increase the time interval network heartbeat.

Regards

admin says:

November 16, 2019 at 1:48 pm

Here is a good descripion how to do it :
https://sort.veritas.com/public/documents/sfha/6.0.1/linux/productguides/html/sfrac_install/ch23s05.htm

But note this high value should be used only if you use a vendor clusterware.

Reply

Triggering a NHB Note Eviction on Node grac2

Review CRS Logfile

Review alert.logs

Lets have a short look on Cluster Health monitor files ( CHM)

3 thoughts on “Node Eviction due to failed NHB ( Network Heartbeat )”

Leave a Reply Cancel reply