Node Eviction – system reboot after CSSD process hangs

Overview Node Eviction

in 11gR2, we have many new background processes including two cssd related processes cssdmonitor and cssdagent. Cssdagent process is charge of respawn cssd.bin. Oracle has improved the node fencing in Oracle 11g Release 2 (11.2.0.2) by killing the processes on the failed node that are capable of performing IO and then stopping the Clusterware on the failed node rather than simply rebooting the failed node ( == Instance Eviction ). This works for scenarios like missing Disk and Network Heartbeats but if CSSD process hangs we got a system reboot ( == Node eviction ).

Setup Node Eviction

Find ocsd.bin process ID
#  ps -ef|grep cssd 
root      2849     1  0 13:54 ?        00:00:00 /u01/app/11203/grid/bin/cssdmonitor
root      2868     1  0 13:54 ?        00:00:00 /u01/app/11203/grid/bin/cssdagent
grid      2882     1  0 13:54 ?        00:00:05 /u01/app/11203/grid/bin/ocssd.bin 
root      4310  4291  0 14:06 pts/2    00:00:00 grep cssd

Send STOP signal to ocssd.bin process 
# kill -SIGSTOP 2882 
After a few seconds the system was rebooted.
Check lastgasp file after reboot 
# cd /etc/oracle/lastgasp
# more cssmonit_grac2.lgl
OLG01,0301,0000,GRACE2,2ae3c3415014ef2abf2ff662c5bf8512,grac2,cssmonit,L-2013-08
-03-13:53:56.464,Rebooting after limit 28200 exceeded; disk timeout 28200, netwo
rk timeout 27920, last heartbeat from CSSD at epoch seconds 1375530808.212, 2825
1 milliseconds ago based on invariant clock value of 22202483  

--> Reboot at 2013-08-03-13:53:56.464

Check cssdagent logfile: $GRID_HOME/log/grac2/agent/ohasd/oracssdagent_root
2013-08-03 13:53:53.608: [ USRTHRD][4160747264] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 90% of limit 28200; disk timeout 28200, network timeout 27920, 
          last heartbeat from CSSD at epoch seconds 1375530808.212, 25392 milliseconds ago based on invariant clock 22202492; now polling at 100 ms
2013-08-03 13:53:53.608: [ USRTHRD][4160747264] clsnwork_queue: posting worker thread
2013-08-03 13:53:53.608: [ USRTHRD][4160747264] clsnwork_queue: posting worker thread
2013-08-03 13:53:53.608: [ USRTHRD][88942336] clsnwork_process_work: calling sync
2013-08-03 13:53:53.687: [ USRTHRD][88942336] clsnwork_process_work: sync completed
2013-08-03 13:53:53.687: [ USRTHRD][88942336] clsnSyncComplete: posting omon
2013-08-03 13:53:53.688: [ USRTHRD][88942336] clsnwork_process_work: Unknown work type

Check csssmonitor logfile: $GRID_HOME/log/grac2/agent/ohasd/oracssdmonitor_root
2013-08-03 13:53:53.655: [ USRTHRD][878024448] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 90% of limit 28200; 
           disk timeout 28200, network timeout 27920,
          last heartbeat from CSSD at epoch seconds 1375530808.212, 25441 milliseconds ago based on invariant clock 22202483; now polling at 100 ms
2013-08-03 13:53:53.655: [ USRTHRD][878024448] clsnwork_queue: posting worker thread
2013-08-03 13:53:53.656: [ USRTHRD][878024448] clsnwork_queue: posting worker thread
2013-08-03 13:53:53.656: [ USRTHRD][874870528] clsnwork_process_work: calling sync
2013-08-03 13:54:31.472: [    AGFW][1279174432] Starting the agent: /u01/app/11203/grid/log/grac2/agent/ohasd/oracssdmonitor_root/

Check clusterware alertlog file: $GRID_HOME/log/grac2/alertgrac2.log
2013-08-03 10:55:10.426
[crsd(3080)]CRS-2765:Resource 'ora.grace2.db' has failed on server 'grac2'.
   <-- reboot occurs here - note there is  entry logged for the system reboot in clusterware alertlog
2013-08-03 13:54:30.831
[ohasd(2500)]CRS-2112:The OLR service started on node grac2.
2013-08-03 13:54:30.892
[ohasd(2500)]CRS-1301:Oracle High Availability Service started on node grac2.
2013-08-03 13:54:30.937
[ohasd(2500)]CRS-8011:reboot advisory message from host: grac2, component: cssmonit, with time stamp: L-2013-08-03-13:53:56.464
[ohasd(2500)]CRS-8013:reboot advisory message text: Rebooting after limit 28200 exceeded; disk timeout 28200, 
          network timeout 27920, last heartbeat from CSSD at  epoch seconds 1375530808.212, 28251 milliseconds ago based on 
          invariant clock value of 22202483
2013-08-03 13:54:30.937
[ohasd(2500)]CRS-8017:location: /etc/oracle/lastgasp has 4 reboot advisory log files, 1 were announced and 0 errors occurred
2013-08-03 13:54:32.379
[/u01/app/11203/grid/bin/oraagent.bin(2589)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in 
           "/u01/app/11203/grid/log/grac2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2013-08-03 13:54:46.400
[gpnpd(2810)]CRS-2328:GPNPD started on node grac2. 
2013-08-03 13:54:49.119
[cssd(2882)]CRS-1713:CSSD daemon is started in clustered mode

/var/log/messages
Aug  3 13:54:27 grac2 abrtd: Init complete, entering main loop
Aug  3 13:54:27 grac2 logger: Oracle HA daemon is enabled for autostart.
Aug  3 13:54:27 grac2 logger: exec /u01/app/11203/grid/perl/bin/perl -I/u01/app/11203/grid/perl/lib /u01/app/11203/grid/bin/crswrapexece.pl 
    /u01/app/11203/grid/crs/install/s_crsconfig_grac2_env.txt /u01/app/11203/grid/bin/ohasd.bin "reboot"

Leave a Reply

Your email address will not be published. Required fields are marked *