Overview - - cssd.bin process is multithreaded and tracks and monitors both Disk Hearbeat ( DHB) and Network Heartbeat ) - if there is no Network Heartbeat for more than 30 seconds clssnmvKillBlockThread Thread kills local CRS - Best practise is to use the private Interface for Cluster Interconnect but this can be overwritten - Other key processes - LMS, LMD - if the can't communicate with their counterparts they will initate reconfiguration - LMOM - most important for Instance Eviction - implements IMR and drives reconfiguration Node Evictions Top reason - Communication Errors ( Network related ) - Memory starvation ( Paging Swapping ) - CPU problems ( Scheduler problems / CPU load ) - Other resons - Node membership change due to Split brain issue - Instance Eviction related Bugs : Bug 16876500 - GI HAIP AGENT DROPS A ROUTE FREQUENTLY AND THAT LEADS TO THE INSTANCE EVICTION Bug 14385860 - SOL.SPARC64 : CLSRSC-257: CLUSTER TIME SYNCHRONIZATION SERVICE START IN EXCLUSIV Cluster Reconfiguration - CGS Cluster Group Service - CGS cluster Group Service tracks which instances are members of a cluster - CGS validate all members and update control file periodically - Failure lead to an Instance Membership Reconfiguration ( IMR ) - CGS is responsible for GMS ( Group Membership Syncronsition layer ) and IMR Important Logs and Trace files - Alert logs form all instances ( Cluster alert.log, Rdbms alert.log, ASM alert.log ) - ocssd.logs for all instances ( note older files are named ocssd.l01, .. ) - LMON, LMSn, LMD0 traces from all instances - Any other traces mentioned in any alert.log - lmhb traces ( LMHB monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning ) - CHM and OSWatcher logs from the eviction time - OS message logs form all nodes ( /var/log/messages for Linux ) Review traces in the following order - Cluster Alert logs from all instances - Database and ASM alert logs - LMON traces from all instances - Any trace mentioned in alert.log ( especially LMS, LMD. often also LMHB, DIA* ) - For Communication related Evictions - Review OSWatcher netstat and prvnet and CPU/Paging - Review CHM traces - Review OS logs ( /var/log/messages for Linux )