Debug Node Eviction

Overview
- 
- cssd.bin process is multithreaded and tracks and monitors both Disk Hearbeat ( DHB) and Network Heartbeat )
- if there is no Network Heartbeat for more than 30 seconds clssnmvKillBlockThread Thread kills local CRS
- Best practise is to use the private Interface for Cluster Interconnect but  this can be overwritten
- Other key processes 
    - LMS, LMD - if the can't communicate with their counterparts they will initate reconfiguration
    - LMOM - most important for Instance Eviction - implements IMR and drives reconfiguration 

Node Evictions Top reason
  - Communication Errors ( Network related )
  - Memory starvation ( Paging Swapping )
  - CPU problems ( Scheduler problems / CPU load )
  - Other resons   
    - Node membership change due to Split brain issue 
    - Instance Eviction related Bugs : 
      Bug 16876500 - GI HAIP AGENT DROPS A ROUTE FREQUENTLY AND THAT LEADS TO THE INSTANCE EVICTION 
      Bug 14385860 - SOL.SPARC64 : CLSRSC-257: CLUSTER TIME SYNCHRONIZATION SERVICE START IN EXCLUSIV 

Cluster Reconfiguration - CGS Cluster Group Service
 - CGS cluster Group Service tracks which instances are members of a cluster
 - CGS validate all members and update control file periodically
 - Failure lead to an Instance Membership Reconfiguration ( IMR )
 - CGS is responsible for GMS ( Group Membership Syncronsition layer ) and IMR 

Important Logs and Trace files
  - Alert logs form all instances ( Cluster alert.log, Rdbms alert.log, ASM alert.log )
  - ocssd.logs for all instances ( note older files are named ocssd.l01, .. )
  - LMON, LMSn, LMD0 traces from all instances  
  - Any other traces mentioned in any alert.log
  - lmhb traces ( LMHB monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning )  
  - CHM and OSWatcher logs from the eviction time
  - OS message logs form all nodes (  /var/log/messages for Linux )

Review traces in the following order
  - Cluster Alert logs from all instances
  - Database and ASM alert logs
  - LMON traces  from all instances
  - Any trace mentioned in alert.log ( especially LMS, LMD. often also LMHB, DIA* )
  - For Communication related Evictions 
     - Review OSWatcher netstat and prvnet and CPU/Paging 
     - Review CHM traces
     - Review OS logs ( /var/log/messages for Linux )

Leave a Reply

Your email address will not be published. Required fields are marked *