Troubleshooting Clusterware startup problems with DTRACE

Before setting up DTRACE you may run the following steps below

Cleanup your special sockets file in /var/tmp/.oracle

Either reboot your OS or Cleanup sockets file and reboot CRS stack :
[root@hract21 Desktop]# crsctl stop crs -f
[root@hract21 Desktop]# rm -rf /var/tmp/.oracle/*
[root@hract21 Desktop]# crsctl start crs
 CRS-4123: Oracle High Availability Services has been started.

Note: A complete OS reboot may be needed to fix hanging processes waiting on DISKWAIT 
      If possible always try to do an OS reboot. 
      An OS reboot will always cleanup  /var/tmp/.oracle/*

Quickly verify your OS with a simple sh script : chk_os.sh

#!/bin/bash 
NS=ns1.example.com
HOSTNAME1=hract21.example.com
HOSTNAME2=hract22.example.com
PRIV_IP1=192.168.2.121
PRIV_IP2=192.168.2.122
PUBLIC_IF=eth1
PRIVATE_IF=eth2

echo ""
echo "Disk Space : "
df

echo ""
echo "Major Clusterware Executable Protections : "
ls -l $GRID_HOME/bin/ohasd*
ls -l $GRID_HOME/bin/orarootagent*
ls -l $GRID_HOME/bin/oraagent*
ls -l $GRID_HOME/bin/mdnsd*
ls -l $GRID_HOME/bin/evmd*
ls -l $GRID_HOME/bin/gpnpd*
ls -l $GRID_HOME/bin/evmlogger*
ls -l $GRID_HOME/bin/osysmond.*
ls -l $GRID_HOME/bin/gipcd*
ls -l $GRID_HOME/bin/cssdmonitor*
ls -l $GRID_HOME/bin/cssdagent*
ls -l $GRID_HOME/bin/ocssd*
ls -l $GRID_HOME/bin/octssd*
ls -l $GRID_HOME/bin/crsd
ls -l $GRID_HOME/bin/crsd.bin
ls -l $GRID_HOME/bin/tnslsnr


echo ""
echo "Ping Nameserver: "
ping -c 2  $NS 

echo ""
echo "Test your PUBLIC interface and your nameserver setup"
nslookup $HOSTNAME
ping -I $PUBLIC_IF -c 2   $HOSTNAME1
ping -I $PUBLIC_IF -c 2   $HOSTNAME2
 
ping -I $PRIVATE_IF -c 2   $PRIV_IP1 
ping -I $PRIVATE_IF -c 2   $PRIV_IP2

echo ""
echo "Verify protections for HOSTNAME.pid files should be : 644"
find $GRID_HOME -name hract21.pid  -exec ls -l {} \; 

echo ""
echo "Service iptables and avahi-daemon should not run - avahi-daemon uses CW port 5353 "
service iptables status
ps -elf |grep avahi | grep -v avahi

echo ""
echo "Ports :53 :5353 :42422 :8888 should not be used by NON-Clusterware processes "
echo "  - OC4J reports : tcp   0 0 ::ffff:127.0.0.1:8888  :::*  LISTEN   501 67433979  2580/java"           
netstat -taupen | egrep ":53 |:5353 |:42424 |:8888 "

echo ""
echo "Compare profile.xml the IP Address of PUBLIC and PRIVATE Interfaces "
echo " - Devices should report UP BROADCAST RUNNING MULTICAST "
echo " - Double check NETWORK addresses matches profile.xml settings   "
echo ""
$GRID_HOME/bin/gpnptool get 2>/dev/null  |  xmllint --format - | egrep 'CSS-Profile|ASM-Profile|Network id'
echo ""
ifconfig $PUBLIC_IF | egrep 'eth|inet addr|MTU'
echo ""
ifconfig $PRIVATE_IF | egrep 'eth|inet addr|MTU'

echo "Checking ASM disk status for disk named /dev/asm ...  - you may need to changes this "
ls -l  /dev/asm*

echo ""
echo "Verify ASM disk "
su - grid -c "ssh $HOSTNAME2 ocrcheck"
su - grid -c "ssh $HOSTNAME2  asmcmd lsdsk -k"
echo ""
su - grid -c "kfed read /dev/asmdisk1_10G | grep name"
echo ""
su - grid -c "kfed read /dev/asmdisk2_10G | grep name"
echo ""
su - grid -c "kfed read /dev/asmdisk3_10G | grep name"
echo ""
su - grid -c "kfed read /dev/asmdisk4_10G | grep name"
echo ""


Output:
..
Ports :53 :5353 :42422 :8888 should not be used by NON-Clusterware processes 
  - OC4J reports : tcp   0 0 ::ffff:127.0.0.1:8888  :::*  LISTEN   501 67433979  2580/java
udp        0      0 0.0.0.0:5353                0.0.0.0:*    501        54383580   28618/mdnsd.bin     
udp        0      0 0.0.0.0:5353                0.0.0.0:*    501        54383565   28618/mdnsd.bin     
udp        0      0 0.0.0.0:5353                0.0.0.0:*    501        54383564   28618/mdnsd.bin     
udp        0      0 0.0.0.0:5353                0.0.0.0:*    501        54383563   28618/mdnsd.bin     
udp        0      0 192.168.2.255:42424         0.0.0.0:*    0          54429417   28502/ohasd.bin     
udp        0      0 230.0.1.0:42424             0.0.0.0:*    0          54429416   28502/ohasd.bin     
udp        0      0 224.0.0.251:42424           0.0.0.0:*    0          54429415   28502/ohasd.bin     
udp        0      0 192.168.2.255:42424         0.0.0.0:*    501        54412444   28827/ocssd.bin     
udp        0      0 230.0.1.0:42424             0.0.0.0:*    501        54412443   28827/ocssd.bin     
udp        0      0 224.0.0.251:42424           0.0.0.0:*    501        54412442   28827/ocssd.bin     
udp        0      0 192.168.2.255:42424         0.0.0.0:*    501        54406273   28742/gipcd.bin     
udp        0      0 230.0.1.0:42424             0.0.0.0:*    501        54406272   28742/gipcd.bin     
udp        0      0 224.0.0.251:42424           0.0.0.0:*    501        54406271   28742/gipcd.bin     
udp        0      0 192.168.5.58:53             0.0.0.0:*    0          67400781   2472/gnsd.bin 
tcp        0      0 ::ffff:127.0.0.1:8888        LISTEN      501        67433979   2580/java  
--> mdnsd.bin is using port 5353
    ohasd.bin, ohasd.bin, gipcd.bin are using port 42424
    oc4j is using port 8888           
    GNS is using port 53 

Compare profile.xml the IP Address of PUBLIC and PRIVATE Intefaces 
 - Devices should report UP BROADCAST RUNNING MULTICAST 
 - Double check NETWORK addresses matches profile.xml settings   
    <gpnp:HostNetwork id="gen" HostName="*">
      <gpnp:Network id="net1" IP="192.168.5.0" Adapter="eth1" Use="public"/>
      <gpnp:Network id="net2" IP="192.168.2.0" Adapter="eth2" Use="asm,cluster_interconnect"/>
  <orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/>
  <orcl:ASM-Profile id="asm" DiscoveryString="/dev/asm*" SPFile="+DATA/ract2/ASMPARAMETERFILE/registry.253.870352347" Mode="remote"/>

eth1      Link encap:Ethernet  HWaddr 08:00:27:7D:8E:49  
          inet addr:192.168.5.121  Bcast:192.168.5.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth2      Link encap:Ethernet  HWaddr 08:00:27:4E:C9:BF  
          inet addr:192.168.2.121  Bcast:192.168.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

  --> IP="192.168.5.0" Adapter="eth1" should match --> eth1 : inet addr:192.168.5.121  Bcast:192.168.5.255  Mask:255.255.255.0 
      IP="192.168.2.0" Adapter="eth2" should match --> eth2 : inet addr:192.168.2.121  Bcast:192.168.2.255  Mask:255.255.255.0

Run Cluvfy Debug script : cluv.sh

#!/bin/bash 
#
#  cluvfy commands to debug partial up and running CRS stack 
# 
#  Node1  hract21:  CRS stack is not starting:w
#  Node2  hract22:  CRS stakc is fully up and running 
#
#  Note : This scripts assumes we are logged in on the failing Node : hract21
#
Node1=hract21
Node2=hract22

~/CLUVFY/bin/cluvfy -version
~/CLUVFY/bin/cluvfy comp  nodecon  -n $Node1,$Node2 -i  eth1
~/CLUVFY/bin/cluvfy comp  nodecon  -n $Node1,$Node2 -i  eth2
~/CLUVFY/bin/cluvfy comp nodereach -n $Node1,$Node2 
~/CLUVFY/bin/cluvfy comp sys -p crs
#
# Some of the scripts need a working CRS stack :
# 
# When clusterware is not up and running  cluvfy comp healthcheck may report the eror below : these errors are expected 
#    Cluster Manager Integrity     FAILED     This test checks the integrity of cluster manager across the cluster nodes.
#    Cluster Integrity         FAILED     This test checks the integrity of the cluster.
#    OCR Integrity             FAILED     This test checks the integrity of OCR across the cluster nodes.
#    CRS Integrity             FAILED     This test checks the integrity of Oracle Clusterware stack across the cluster nodes.
#
ssh $Node2 ~/CLUVFY/bin/cluvfy comp healthcheck -collect cluster -html
ssh $Node2 ~/CLUVFY/bin/cluvfy comp software -n  $Node1,$Node2 
# Testing multicast 
ssh $Node2 ~/CLUVFY/bin/cluvfy stage -post hwos -n $Node1
ssh $Node2 ~/CLUVFY/bin/cluvfy stage -pre crsinst -n $Node1 -networks eth1:192.168.5.0:PUBLIC/eth2:192.168.2.0:cluster_interconnect -verbose

Review clusterware  Tracefiles

Review the following traces on the failed Node :
   /var/log/messages
   /u01/app/grid/diag/crs/hract21/crs/trace/alert.log

Search CW tracefile [ this commamnd scans entries for a 10 minute timer period ]
# grep "2015-02-16 09:2" *  | egrep "no exe permission|failed to start process"
# grep "2015-02-16 10:3" *  | egrep " fatal | terminating "
# grep "2015-02-17 11:4" *  | grep gipcmodNetworkProcessBind
# grep "2015-02-17 14:1" *  | grep gipcmodNetworkResolve

Debug Clusterware startup problems with DTRACE

Why to use DTRACE ?

  • Using DTRACE scripts can save you many hours of debugging
  • DTRACE scritps can be used to capture intermittent errors
  • DTRACE can easily configured to track very specific things like:   [ How many bytes where written by the 5.th I/O for file XXXX after it gets newly created  ]
  • DTRACE can provide very detail information about system call usage and timing
  • DTRACE is the next level of debugging on Linux much better than strace
  • Note :  Not  all reported errors by DTRACE  are significant and prevents CW from booting
  • Best practise is to test the start on a working node to figure out what is relevant or not

If you change a lot of things for testing and if you have many RAC instances running
  where lot of people are testing  DTRACE will be become you best friend !

 

Version used and OS error codes handled by DTRACE script

  • OEL 6.6 with Linux hract21.example.com 3.8.13-55.1.2.el6uek.x86_64
  • Oracle 12.1.0.2
Installed Dtrace packages:
[root@hract21 Desktop]# rpm -qa | grep dtrace
dtrace-utils-0.4.5-2.el6.x86_64
dtrace-utils-devel-0.4.5-2.el6.x86_64
libdtrace-ctf-0.4.1-1.x86_64
dtrace-modules-3.8.13-44.1.1.el6uek-0.4.3-4.el6.x86_64
libdtrace-ctf-devel-0.4.1-1.x86_64
dtrace-modules-headers-0.4.3-4.el6.x86_64
dtrace-modules-3.8.13-55.1.2.el6uek-0.4.3-4.el6.x86_64

Error Codes tracked by dtrace script: 
/usr/include/asm-generic/errno-base.h

#define    ENOENT        2    /* No such file or directory */
#define    ENXIO         6    /* No such device or address */
#define    EAGAIN        11    /* Try again */
#define    EACCES        13    /* Permission denied */

/usr/include/asm-generic/errno.h
#define    EADDRINUSE       98    /* Address already in use */
#define    EADDRNOTAVAIL    99    /* Cannot assign requested address */
#define    ECONNREFUSED    111    /* Connection refused */
#define    EHOSTUNREACH    113    /* No route to host */
#define    EINPROGRESS     115    /* Operation now in progress */

First DTRACE steps

  • For DTRACE installation and first steps please read the following arcticle
  • If you have DTRACE already installed run the commands below
# modprobe systrace 
Now check the provider
[root@oel65 Dtrace]# dtrace -l 
   ID   PROVIDER            MODULE                          FUNCTION NAME
    1     dtrace                                                     BEGIN
    2     dtrace                                                     END
    3     dtrace                                                     ERROR
    4    syscall           vmlinux                              read entry
    5    syscall           vmlinux                              read return
..

 

Debug CW startup problems with DTRACE  – Different Case studies

During the testing we will rate 3 different tools : DTRACE, Cluvfy and READ Tracefiles
All Debug utils get rated with up to 5 stars : ***** stars depending on usage
Case I   : Any execve() errors for all executables like:  crsd.bin
DTRACE        : ***
CLUVFY        : *****  [ most useful tool ]
Reading Traces: *
Comment       : Debugging steps are the same for all CW executables
Case II  : Open errors for files HOSTNAME.pid like hract21.pid
DTRACE        : ***** [ most useful tool ]
CLUVFY        :
Reading Traces: *
Comment       : Debugging steps are the same for all CRS components
Case III : Open errors for special socket files like : /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock
Open errors for files HOSTNAME.pid like hract21.pid
DTRACE        : *****  [ most useful tool ]
CLUVFY        :
Reading Traces: **
Comment       : Debugging steps are the same for all CRS components
Case IV  : GPNPD  doesn't start - mismatch between profile.xml and the PRIVATE INTERFACE address
DTRACE        :
CLUVFY        : *****  [ most useful tool ]
Reading Traces: **
Case V   : GIPCD  doesn't start - mismatch between profile.xml and the PUBLIC  INTERFACE address
DTRACE        : ****
CLUVFY        : *****  [ most useful tool ]
Reading Traces: **
Case VI  : CSSD doesn't start as Port 42424 is already in use
DTRACE        : *****  [ most useful tool ]
CLUVFY        :
Reading Traces: *
Case VII : CSSD,GIPCD doesn't start as Port 5353 is already in use
DTRACE        : *****  [ most useful tool ]
CLUVFY        :
Reading Traces: *
Case VIII: GIPCD, GPNPD CSDD not starting as Nameserver is not reachable
          [ ECONNREFUSED    111 -> Connection refused ]
DTRACE        : ***
CLUVFY        : *****  [ most useful tool ]
Reading Traces: *
    • Note : CLUVFY is no always the best choice to debug CW startup problems.
    • DTRACE is an easy to use utility and any DBA should use this utiltiy in its production env as it can can reduce the error solving process a lot !!

Prepare DTRACE script check_rac.d  and run it first time

You need only change your grid location and the pid_file in the BEGIN section
grid_loc = "/u01/app/121/grid";    <<< Change this
pid_file="hract21.pid"             <<< Change this - replace hract21 with your hostname

As not all OS errors are fatal run DTRACE script  on a working node first : [ including comments ]
[root@hract21 DTRACE]# dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
---> You can ignore the     ENXIO 6    - No such device or address - during opening  /var/tmp/.oracle/npohasd
0     89                   connect:return - Exec: gipcd.bin - PID: 26471  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0     89                   connect:return - Exec: gipcd.bin - PID: 26489  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0     89                   connect:return - Exec: gipcd.bin - PID: 26505  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
---> gipcd.bin successfully connect to our name server
0     93                    sendto:return - Exec: mdnsd.bin - PID: 26428  sendto() failed with error : -32 - fd : 46
0     93                    sendto:return - Exec: crsd.bin - PID: 26874  sendto() failed with error : -32 - fd : 210
0    103                      bind:return - Exec: ons - PID: 27044  bind() failed with error : -98 - fd : 9 - IP: 0.0.0.0 - Port: 6200
0     89                   connect:return - Exec: gipcd.bin - PID: 26571  connect() failed with error : -113 - fd : 191 - IP: 192.168.5.123 - Port: 43956
0     89                   connect:return - Exec: gipcd.bin - PID: 26571  connect() failed with error : -113 - fd : 122 - IP: 192.168.5.123 - Port: 43956
0     89                   connect:return - Exec: gipcd.bin - PID: 26571  connect() failed with error : -113 - fd : 122 - IP: 192.168.5.123 - Port: 43956
0     89                   connect:return - Exec: gipcd.bin - PID: 26571  connect() failed with error : -113 - fd : 122 - IP: 192.168.5.123 - Port: 43956
0     89                   connect:return - Exec: gipcd.bin - PID: 26571  connect() failed with error : -113 - fd : 120 - IP: 192.168.5.123 - Port: 43956
0     89                   connect:return - Exec: gipcd.bin - PID: 26571  connect() failed with error : -113 - fd : 121 - IP: 192.168.5.123 - Port: 43956
0     89                   connect:return - Exec: gipcd.bin - PID: 26571  connect() failed with error : -113 - fd : 122 - IP: 192.168.5.123 - Port: 43956
0     89                   connect:return - Exec: gipcd.bin - PID: 26571  connect() failed with error : -113 - fd : 123 - IP: 192.168.5.123 - Port: 43956
--->  gipcd.bin fails connect to htrac23 node with error:     EHOSTUNREACH    113  No route to host -  as this node was down
0    123                    execve:return - Exec: oradism - execve()  /u01/app/121/grid/bin/oradism -  Lower CLUSTERWARE stack successfully started - ret code:  0 - EXITING !
---> As soon CW comes up / ASM run executable oradism we stop DTRACE script by calling exit !

Case I: CRSD daemon doesn’t start due to permission problem ( crsd/crsd.bin)

Force that  error and monitor Clusterware Resource status after startup:

The command below prints the status of the lower CRS stack every 2 seconds
This will allow you to track which compomet is starting again and finally failing 
# watch crsi
*****  Local Resources: *****
Resource NAME               INST   TARGET      STATE        SERVER          STATE_DETAILS
--------------------------- ----   ---------- ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE    ONLINE       hract21         STABLE
ora.cluster_interconnect.haip  1   ONLINE    ONLINE       hract21         STABLE
ora.crf                        1   ONLINE    ONLINE       hract21         STABLE
ora.crsd                       1   ONLINE    OFFLINE      -               STABLE
ora.cssd                       1   ONLINE    ONLINE       hract21         STABLE
ora.cssdmonitor                1   ONLINE    ONLINE       hract21         STABLE
ora.ctssd                      1   ONLINE    ONLINE       hract21         OBSERVER,STABLE
ora.diskmon                    1   OFFLINE   OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE    ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE    INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE    ONLINE       hract21         STABLE
ora.gpnpd                      1   ONLINE    ONLINE       hract21         STABLE
ora.mdnsd                      1   ONLINE    ONLINE       hract21         STABLE
ora.storage                    1   ONLINE    ONLINE       hract21         STABLE
--> CRSD doesn't start

CaseI-1: Shell script /u01/app/121/grid/bin/crsd with wrong protection
# chmod 000 /u01/app/121/grid/bin/crsd

CLUVFY:
Note as our CRS stack is not fully running on hract21 we have to start cluvfy from hract22
[grid@hract21 CLUVFY]$ ssh hract22  ~/CLUVFY/bin/cluvfy comp software -n hract21
Verifying software
Check: Software
Component: crs
Node Name: hract21
   /u01/app/121/grid/bin/crsd..."Permissions" did not match reference
3162 files verified
Software check failed

TRACFILE review :
Use Grep command:
# grep "2015-02-16 09:2" *  | egrep "no exe permission|failed to start process"

alert.log:
2015-02-16 09:25:05.681 [ORAROOTAGENT(16807)]CRS-5013: Agent "ORAROOTAGENT" failed to start process "/u01/app/121/grid/bin/crsd"
for action "start": details at "(:CLSN00008:)" in "/u01/app/grid/diag/crs/hract21/crs/trace/ohasd_orarootagent_root.trc"
2015-02-16 09:25:06.706 [ORAROOTAGENT(16807)]CRS-5013: Agent "ORAROOTAGENT" failed to start process "/u01/app/121/grid/bin/crsd"
for action "start": details at "(:CLSN00008:)" in "/u01/app/grid/diag/crs/hract21/crs/trace/ohasd_orarootagent_root.trc"

/u01/app/grid/diag/crs/hract21/crs/trace/ohasd_orarootagent_root.trc:
2015-02-16 09:25:06.704671 :CLSDYNAM:3169421056: [ora.crsd]{0:0:2} [start] (:CLSN00008:)Utils:execCmd scls_process_spawn() failed 1
2015-02-16 09:25:06.704859 :CLSDYNAM:3169421056: [ora.crsd]{0:0:2} [start] (:CLSN00008:) category: -1, operation: fail, loc: canexec2,
OS error: 0, other: no exe permission, file [/u01/app/121/grid/bin/crsd]
2015-02-16 09:25:06.705358 :CLSDYNAM:3169421056: [ora.crsd]{0:0:2} [start] clsnUtils::error Exception type=2 string=
CRS-5013: Agent "ORAROOTAGENT" failed to start process "/u01/app/121/grid/bin/crsd" for action "start":
details at "(:CLSN00008:)" in "/u01/app/grid/diag/crs/hract21/crs/trace/ohasd_orarootagent_root.trc"

DTRACE :
DTRACE is not helpful here !
--> Using dtrace will not help as  OS error: 0
   CW software reads the protections of crsd shell script and stops clusterware startup if protections are not sufficient.

FIX :
Change permsission and restart CRS
[root@hract21 Desktop]#  ls -l    /u01/app/121/grid/bin/crsd
---------- 1 root oinstall 9419 Jan 30 12:26 /u01/app/121/grid/bin/crsd
[root@hract21 Desktop]# chmod 755  /u01/app/121/grid/bin/crsd
[root@hract21 Desktop]# ls -l    /u01/app/121/grid/bin/crsd
rwxr-xr-x 1 root oinstall 9419 Jan 30 12:26 /u01/app/121/grid/bin/crsd


CaseI-2: Executable /u01/app/121/grid/bin/crsd.bin with wrong protection
#  ls -l  /u01/app/121/grid/bin/crsd.bin
-rwxr----x 1 root oinstall 369207571 Jun 30  2014 /u01/app/121/grid/bin/crsd.bin
# chmod 000 /u01/app/121/grid/bin/crsd.bin

CLUVFY :
Note as our CRS stack is not fully running on hract21 we have to start cluvfy from hract22
[grid@hract21 CLUVFY]$ ssh hract22  ~/CLUVFY/bin/cluvfy comp software -n hract21
Verifying software
Check: Software
Component: crs
Node Name: hract21
    /u01/app/121/grid/bin/crsd.bin..."Permissions" did not match reference
3162 files verified
Software check failed

TRACFILE review :
Grep command:
$  grep "2015-02-16 10:3" *  | egrep " fatal | terminating "

alert.log
2015-02-16 10:30:04.075 [OCSSD(29033)]CRS-1713: CSSD daemon is started in hub mode
2015-02-16 10:30:04.271 [OCSSD(29033)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc
2015-02-16 10:30:05.297 [OCSSD(29033)]CRS-1652: Starting clean up of CRSD resources.
2015-02-16 10:30:05.307 [OCSSD(29033)]CRS-1653: The clean up of the CRSD resources failed.
Mon Feb 16 10:30:05 2015
Errors in file /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc  (incident=2897):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2897/ocssd_i2897.trc

2015-02-16 10:30:05.475 [OCSSD(29033)]CRS-8503: Oracle Clusterware OCSSD process with operating system process ID 29033 experienced fatal signal or exception code 6
Sweep [inc][2897]: completed

--> The above mentioned tracefiles doesn't help - so what is the real problem here ?

DTRACE SCRIPT :
/*
Tracing all failed  execve calls and stops tracing after oradism was started by calling exit()
*/
syscall::execve:entry
{
self->path = copyinstr(arg0);
}

/*
Trace all failed execve call on  the system
This allows us to track failed calls for OS utils like awk, sed .....
*/
syscall::execve:return
/arg0<0 && execname != "crsctl.bin"/
{
printf("- Exec: %s - execve()    %s failed with error: %d", execname, self->path, arg0 );
self->path = 0;
}

/*
Track successful CW start - stop tracing after oradism was started by calling exit()
*/

syscall::execve:return
/execname == "oradism"/
{
printf("- Exec: %s - execve()  %s -  Lower CLUSTERWARE stack successfully started - ret code:  %d - EXITING !", execname, self->path, arg0 );
self->path = 0;
exit(0);
}

DTRACE OUTPUT :
Using our dtrace script will quickly identify the root cause  [ crsd.bin failed due to Permission error ]
[root@hract21 DTRACE]# dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0     89                   connect:return - Exec: mdnsd.bin - PID: 17355  connect() failed with error : -101 - fd : 39 - IP: 17.17.17.17 - Port: 256
0     89                   connect:return - Exec: gipcd.bin - PID: 17400  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0     89                   connect:return - Exec: gipcd.bin - PID: 17412  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0     89                   connect:return - Exec: gipcd.bin - PID: 17429  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0    123                    execve:return - Exec: crsd - execve()    /u01/app/121/grid/bin/crsd.bin failed with error: -13

FIX :
Change permsission and restart CRS
[root@hract21 Desktop]#  ls -l    /u01/app/121/grid/bin/crsd.bin
---------- 1 root oinstall 369207571 Jun 30  2014 /u01/app/121/grid/bin/crsd.bin
[root@hract21 Desktop]# chmod 741 /u01/app/121/grid/bin/crsd.bin
[root@hract21 Desktop]# ls -l    /u01/app/121/grid/bin/crsd.bin
[root@hract21 network-scripts]# ls -l    /u01/app/121/grid/bin/crsd.bin
-rwxr----x 1 root oinstall 369207571 Jun 30  2014 /u01/app/121/grid/bin/crsd.bin

Case II : GIPCD daemon doesn’t start as HOSTNAME.pid file is not readable

Clusterware 12.1.0.2 uses the follow HOSTNAME.pid files for reporting the PID .
If CW can't write to that PID the related CW componet may not start
/u01/app/121/grid/ohasd/init/hract21.pid
/u01/app/121/grid/osysmond/init/hract21.pid
/u01/app/121/grid/gpnp/init/hract21.pid
/u01/app/121/grid/gipc/init/hract21.pid
/u01/app/121/grid/log/hract21/gpnpd/hract21.pid
/u01/app/121/grid/ctss/init/hract21.pid
/u01/app/121/grid/gnsd/init/hract21.pid
/u01/app/121/grid/crs/init/hract21.pid
/u01/app/121/grid/crf/admin/run/crflogd/lhract21.pid
/u01/app/121/grid/crf/admin/run/crfmond/shract21.pid
/u01/app/121/grid/evm/init/hract21.pid
/u01/app/121/grid/mdns/init/hract21.pid
/u01/app/121/grid/ologgerd/init/hract21.pid

Create error and monotor Clusterware Resource status after startup:
[root@hract21 DTRACE]# ls -l /u01/app/121/grid/gipc/init/hract21.pid
-rw-r--r-- 1 grid oinstall 6 Feb 16 10:30 /u01/app/121/grid/gipc/init/hract21.pid
[root@hract21 DTRACE]# chmod 000  /u01/app/121/grid/gipc/init/hract21.pid

*****  Local Resources: *****
Resource NAME               INST   TARGET       STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE       OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE       OFFLINE      -               STABLE
ora.crf                        1   ONLINE       OFFLINE      -               STABLE
ora.crsd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssdmonitor                1   OFFLINE      OFFLINE      -               STABLE
ora.ctssd                      1   ONLINE       OFFLINE      -               STABLE
ora.diskmon                    1   OFFLINE      OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE       ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE       INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE       OFFLINE      hract21         STARTING
ora.gpnpd                      1   ONLINE       INTERMEDIATE hract21         STABLE
ora.mdnsd                      1   ONLINE       ONLINE       hract21         STABLE
ora.storage                    1   ONLINE      OFFLINE      -               STABLE
--> GIPCD  doensn't  start 
CLUVFY:
Found no cluvfy command to detect this error

TRACFILE review :
alert.log :
Mon Feb 16 12:09:03 2015
Errors in file /u01/app/grid/diag/crs/hract21/crs/trace/gipcd.trc  (incident=2921):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2921/gipcd_i2921.trc
2015-02-16 12:09:03.181 [GIPCD(14763)]CRS-8503: Oracle Clusterware GIPCD process with operating system process ID 14763
experienced fatal signal or exception code 6 - Sweep [inc][2921]: completed
--> Got no further indications that file permissons on file /u01/app/121/grid/gipc/init/hract21.pid are the root cause

DTRACE SCRIPT:
syscall::open:entry
{
self->path = copyinstr(arg0);
}

syscall::open:return
/arg0<0 && execname!= "crsctl.bin" && substr( self->path,0,grid_len)==  grid_loc &&  strstr(self->path, pid_file ) == pid_file  /
{
printf("- Exec: %s - open() %s failed with error: %d - scan_dir:  %s - PID-File : %s ", execname, self->path, arg0, substr( self->path,0,grid_len), pid_file );
}

DTRACE OUTPUT :
DTrace helps us to find that problem very quickly :
[root@hract21 DTRACE]#  dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: oraagent.bin - open() /u01/app/121/grid/gipc/init/hract21.pid failed with error: -13 - scan_dir:  /u01/app/121/grid - PID-File : hract21.pid
0     89                   connect:return - Exec: mdnsd.bin - PID: 19658  connect() failed with error : -101 - fd : 39 - IP: 17.17.17.17 - Port: 256
0     89                   connect:return - Exec: gipcd.bin - PID: 19702  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0      9                      open:return - Exec: gipcd.bin - open() /u01/app/121/grid/gipc/init/hract21.pid failed with error: -13 - scan_dir:  /u01/app/121/grid - PID-File : hract21.pid
0      9                      open:return - Exec: gipcd.bin - open() /u01/app/121/grid/gipc/init/hract21.pid failed with error: -13 - scan_dir:  /u01/app/121/grid - PID-File : hract21.pid

FIX :
Change permission and reboot Clusterware :
[root@hract21 DTRACE]# chmod 644 /u01/app/121/grid/gipc/init/hract21.pid
[root@hract21 DTRACE]# ls -l /u01/app/121/grid/gipc/init/hract21.pid
-rw-r--r-- 1 grid oinstall 5 Feb 16 11:37 /u01/app/121/grid/gipc/init/hract21.pid

 

Case III: GPnP fails due to unreadable CW “special” socket files located in var/tmp/.oracle/

Typically this directory contains a number of "special" socket files that are used by local clients to
connect via the IPC protocol (sqlnet) to various Oracle processes including the TNS listener, the CSS,
CRS & EVM daemons or even  database or ASM instances
Force that  error and monitor Clusterware Resource status after startup:

[root@hract21 DTRACE]# chmod 000  /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock
[root@hract21 DTRACE]#  ls -l  /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock
---------- 1 grid oinstall 0 Feb 15 08:18 /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock

*****  Local Resources: *****
Resource NAME               INST   TARGET       STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE       OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE       OFFLINE      -               STABLE
ora.crf                        1   ONLINE       OFFLINE      -               STABLE
ora.crsd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssdmonitor                1   OFFLINE      OFFLINE      -               STABLE
ora.ctssd                      1   ONLINE       OFFLINE      -               STABLE
ora.diskmon                    1   OFFLINE      OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE       ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE       INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE       OFFLINE      -               STABLE
ora.gpnpd                      1   ONLINE       OFFLINE      hract21         STARTING
ora.mdnsd                      1   ONLINE       ONLINE       hract21         STABLE
ora.storage                    1   ONLINE       OFFLINE      -               STABLE
--> GPnPD doesn't start !

CLUVFY:
Found no cluvfy command to detect this error

TRACFILE review :
alert.log :
2015-02-16 12:37:45.345 [GPNPD(21849)]CRS-8500: Oracle Clusterware GPNPD process is starting with operating system process ID 21849
2015-02-16 12:37:45.388 [GPNPD(21849)]CRS-8501: Oracle Clusterware GPNPD process with operating system process ID 21849 is ending with return value 48
2015-02-16 12:47:45.281 [ORAAGENT(21818)]CRS-5818: Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2}
in /u01/app/grid/diag/crs/hract21/crs/trace/ohasd_oraagent_grid.trc.

ohasd_oraagent_grid.trc:
2015-02-16 12:47:45.837353 :    AGFW:308582144: {0:0:2} Agent sending reply for: RESOURCE_START[ora.gpnpd 1 1] ID 4098:270
2015-02-16 12:47:45.838926 :CLSDYNAM:310683392: [ora.gpnpd]{0:0:2} [start] (:CLSN00107:) clsn_agent::start }
2015-02-16 12:47:49.288144 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [start] got lock
2015-02-16 12:47:49.288165 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [start] tryActionLock }
2015-02-16 12:47:49.288187 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [start] abort  }
2015-02-16 12:47:49.288205 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [start] (:CLSN00110:) clsn_agent::abort }
2015-02-16 12:47:49.288230 :    AGFW:304379648: {0:0:2} Command: start for resource: ora.gpnpd 1 1 completed with status: TIMEDOUT
2015-02-16 12:47:49.288313 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] Utils::getOracleHomeAttrib getEnvVar oracle_home:/u01/app/121/grid
2015-02-16 12:47:49.288361 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] Utils::getOracleHomeAttrib oracle_home:/u01/app/121/grid
2015-02-16 12:47:49.288672 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] PID 20189 from /u01/app/121/grid/gpnp/init/hract21.pid
2015-02-16 12:47:49.288708 :  CLSDMC:304379648: Connecting to ipc://hract21_DBG_GPNPD
2015-02-16 12:47:49.289427 :  CLSDMC:304379648: Error: gipcWait for gipcConnect - ret_gipcreqinfo=gipcretConnectionRefused, type_gipcreqinfo=gipcreqtypeConnect
2015-02-16 12:47:49.290221 :    AGFW:308582144: {0:0:2} Agent sending reply for: RESOURCE_START[ora.gpnpd 1 1] ID 4098:270
2015-02-16 12:47:49.292126 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] ClsdmClient::sendMessage clsdmc_send error rmsg:0 ecode:-7 errbuf:
2015-02-16 12:47:49.292208 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] Calling PID check for daemon
2015-02-16 12:47:49.292269 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] Process id 20189 translated to
2015-02-16 12:47:49.292384 :  CLSDMC:304379648: Connecting to ipc://hract21_DBG_GPNPD
2015-02-16 12:47:49.292604 :  CLSDMC:304379648: Error: gipcWait for gipcConnect - ret_gipcreqinfo=gipcretConnectionRefused, type_gipcreqinfo=gipcreqtypeConnect
2015-02-16 12:47:49.292658 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] ClsdmClient::sendMessage clsdmc_send error rmsg:0 ecode:-7 errbuf:
2015-02-16 12:47:49.293759 :    AGFW:308582144: {0:0:2} ora.gpnpd 1 1 state changed from: STARTING to: OFFLINE
2015-02-16 12:47:49.293871 :    AGFW:308582144: {0:0:2} Agent sending last reply for: RESOURCE_START[ora.gpnpd 1 1] ID 40
--> Got no further indications that file permissons on file /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock  is the root cause !

DTRACE SCRIPT :
syscall::open:entry
{
self->path = copyinstr(arg0);
}

syscall::open:return
/arg0<0 && substr( self->path,0,var_tmp_len) == var_tmp_loc &&  substr(execname,0,4)  != "asm_"  /
{
printf("- Exec: %s - open() %s failed with error: %d - scan_dir:  %s ", execname, self->path, arg0, substr( self->path,0,var_tmp_len)   );
}

DTRACE OUTPUT :
Again DTrace helps us to find that problem very quickly :
[root@hract21 DTRACE]# !dt
dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0     89                   connect:return - Exec: mdnsd.bin - PID: 20575  connect() failed with error : -101 - fd : 39 - IP: 17.17.17.17 - Port: 256
0      9                      open:return - Exec: gpnpd.bin - open() /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock failed with error: -13 - scan_dir:  /var/tmp/.oracle

FIX :
Change permission and reboot Clusterware :
[root@hract21 DTRACE]# chmod 644  /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock
[root@hract21 DTRACE]# ls -l  /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock
-rw-r--r-- 1 grid oinstall 0 Feb 15 08:18 /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock

Case IV:  GPNPD doesn’t start – mismatch between profile.xml and the PRIVATE INTERFACE address

Potential problem:

  • PUBLIC interface was changed without changing profile.xml
Monitor Clusterware Resource status after startup:
*****  Local Resources: *****
Resource NAME               INST   TARGET       STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE       OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE    OFFLINE      -               STABLE
ora.crf                        1   ONLINE    ONLINE       hract21         STABLE
ora.crsd                       1   ONLINE    OFFLINE      -               STABLE
ora.cssd                       1   ONLINE    OFFLINE      hract21         STARTING
ora.cssdmonitor                1   ONLINE     ONLINE       hract21         STABLE
ora.ctssd                      1   ONLINE    OFFLINE      -               STABLE
ora.diskmon                    1   ONLINE     OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE    ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE    INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE    ONLINE       hract21         STABLE
ora.gpnpd                      1   ONLINE    INTERMEDIATE hract21         STABLE
ora.mdnsd                      1   ONLINE    ONLINE       hract21         STABLE
ora.storage                    1   ONLINE    OFFLINE      -               STABLE
--> GPnPD daemon does not start

CLUVFY:
Cluvfy fails with PRVG-11050 error
[grid@hract21 CLUVFY]$  ssh hract22   ~/CLUVFY/bin/cluvfy stage -post crsinst -n hract21,hract22
Performing post-checks for cluster services setup
Checking node reachability...
Node reachability check passed from node "hract22"
Checking user equivalence...
User equivalence check passed for user "grid"
Checking node connectivity...
Checking hosts config file...
Verification of the hosts config file successful
ERROR:
PRVG-11050 : No matching interfaces "eth2" for subnet "192.168.2.0" on nodes "hract21"

TRACEFILE review :
alert.log:
2015-02-17 09:42:27.823 [OCSSD(15855)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc
2015-02-17 09:42:27.824 [OCSSD(15855)]CRS-1603: CSSD on node hract21 shutdown by user.
2015-02-17 09:42:27.823 [CSSDAGENT(15844)]CRS-5818: Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/grid/diag/crs/hract21/crs/trace/ohasd_cssdagent_root.trc.
Tue Feb 17 09:42:32 2015
Errors in file /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc  (incident=2977):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2977/ocssd_i2977.trc

2015-02-17 09:42:33.019 [OCSSD(15855)]CRS-8503: Oracle Clusterware OCSSD process with operating system process ID 15855 experienced fatal signal or exception code 6
Sweep [inc][2977]: completed
2015-02-17 09:42:38.005 [OHASD(11954)]CRS-2757: Command 'Start' timed out waiting for response from the resource 'ora.cssd'. Details at (:CRSPE00163:) {0:0:2} in /u01/app/grid/diag/crs/hract21/crs/trace/ohasd.trc.

ocssd.trc:
2015-02-17 09:42:32.451021 :    CSSD:2417551104: 
   clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963949, LATS 92477974, lastSeqNo 963946, uniqueness 1424074596, timestamp 1424162551/21220694
2015-02-17 09:42:32.451113 :    CSSD:2422281984: 
   clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963950, LATS 92477974, lastSeqNo 963947, uniqueness 1424074596, timestamp 1424162552/21220904
Trace file /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc
Oracle Database 12c Clusterware Release 12.1.0.2.0 - Production Copyright 1996, 2014 Oracle. All rights reserved.
DDE: Flood control is not active
CLSB:2467473152: Oracle Clusterware infrastructure error in OCSSD (OS PID 15855): Fatal signal 6 has occurred in program ocssd thread 2467473152; nested signal count is 1
Incident 2977 created, dump file: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2977/ocssd_i2977.trc
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
2015-02-17 09:42:33.108629 :    CSSD:2450904832: clssscWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 1000 with cvtimewait status 4294967186
2015-02-17 09:42:33.451785 :    CSSD:2417551104: clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963952, LATS 92478974, lastSeqNo 963949, uniqueness 1424074596, timestamp 1424162552/21221694
2015-02-17 09:42:33.451933 :    CSSD:2422281984: clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963953, LATS 92478974, lastSeqNo 963950, uniqueness 1424074596, timestamp 1424162553/21221904
--> Here we know that we have a networking problem

DTRACE OUTPUT:
- In this case DTRACE will no help .
Oracle will retrieve the IP-Addresses via ioctl can compare to profile.xml
32373 <... ioctl resumed> 200, {{"lo", {AF_INET, inet_addr("127.0.0.1")}}, {"eth0", {AF_INET, inet_addr("192.168.1.7")}},
{"eth1", {AF_INET, inet_addr("192.168.5.121")}}, {"eth2", {AF_INET, inet_addr("192.168.7.121")}},
{"eth3", {AF_INET, inet_addr("192.168.3.121")}}}}) = 0

Investigate & Fix :
Check profile.xml
[root@hract21 network-scripts]#   $GRID_HOME/bin/gpnptool get 2>/dev/null  |  xmllint --format - | egrep 'CSS-Profile|ASM-Profile|Network id'
<gpnp:HostNetwork id="gen" HostName="*">
<gpnp:Network id="net1" IP="192.168.5.0" Adapter="eth1" Use="public"/>
<gpnp:Network id="net2" IP="192.168.2.0" Adapter="eth2" Use="asm,cluster_interconnect"/>
<orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/>
<orcl:ASM-Profile id="asm" DiscoveryString="/dev/asm*" SPFile="+DATA/ract2/ASMPARAMETERFILE/registry.253.870352347" Mode="remote"/>
-> eth2 is our CI network interface - with 192.168.2.0 as the related NETWORK address

[grid@hract21 trace]$ ping -I eth2 192.168.2.122
Warning: cannot bind to specified iface, falling back: Operation not permitted
PING 192.168.2.122 (192.168.2.122) from 192.168.1.7 eth2: 56(84) bytes of data
--> This tells us we have a problem with our CI !

[root@hract21 network-scripts]# ifconfig eth2
eth2      Link encap:Ethernet  HWaddr 08:00:27:4E:C9:BF
inet addr:192.168.7.121  Bcast:192.168.7.255  Mask:255.255.255.0
inet6 addr: fe80::a00:27ff:fe4e:c9bf/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
--> eth2 is up and running but listening on the wrong network address
Fix :
  Change address for eth2  back to  inet addr:192.168.2.121 and restart network and CW

 

Case V  :  GIPCD  doesn’t start – mismatch between profile.xml and the PUBLIC  INTERFACE address

Potential problem:

  • /etc/hosts and nslookup not in sync
  • PUBLIC interfase was changed without changing profile.xml
  • DNS returned a wrong host address
Force that  error and monitor Clusterware Resource status after startup:
*****  Local Resources: *****
Resource NAME               INST   TARGET       STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE       OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE       OFFLINE      -               STABLE
ora.crf                        1   ONLINE       ONLINE       hract21         STABLE
ora.crsd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssdmonitor                1   ONLINE       ONLINE       hract21         STABLE
ora.ctssd                      1   ONLINE       OFFLINE      -               STABLE
ora.diskmon                    1   ONLINE       OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE       ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE       INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE       OFFLINE      -               STABLE
ora.gpnpd                      1   ONLINE       ONLINE       hract21         STABLE
ora.mdnsd                      1   ONLINE       ONLINE       hract21         STABLE
ora.storage                    1   ONLINE       OFFLINE      -               STABLE
--> GIPCS doesn't start 
CLUVFY:
[grid@hract21 CLUVFY]$ cluvfy  comp nodecon -n hract21,hract22
Verifying node connectivity
ERROR:
PRVF-6006 : unable to reach the IP addresses "hract21,hract22" from the local node
PRKC-1071 : Nodes "hract21,hract22" did not respond to ping in "3" seconds,
PRKN-1035 : Host "hract21" is unreachable
PRKN-1035 : Host "hract22" is unreachable
Verification cannot proceed
Verification of node connectivity was unsuccessful on all the specified nodes.

TRACEFILE review :
gipcd.trc:
2015-02-17 11:48:39.300878 :GIPCXCPT:3369244416:  gipcmodNetworkProcessBind: slos op  :  sgipcnTcpBind
2015-02-17 11:48:39.300880 :GIPCXCPT:3369244416:  gipcmodNetworkProcessBind: slos dep :  Cannot assign requested address (99)
2015-02-17 11:48:39.300882 :GIPCXCPT:3369244416:  gipcmodNetworkProcessBind: slos loc :  bind
2015-02-17 11:48:39.300884 :GIPCXCPT:3369244416:  gipcmodNetworkProcessBind: slos info:  addr '192.168.7.121:0'
2015-02-17 11:48:39.300920 :GIPCXCPT:3369244416:  gipcBindF [gipcInternalEndpoint : gipcInternal.c : 468]: EXCEPTION[ ret gipcretAddressNotAvailable (39) ]
failed to bind endp 0x7fb6a4027990 [0000000000000306] { gipcEndpoint : localAddr 'tcp://192.168.7.121', remoteAddr '', numPend 0,
numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj (nil), sendp 0x7fb6a4033bd0
status 13flags 0x20008000, flags-2 0x0, usrFlags 0x20020 }, addr 0x7fb6a4033070 [000000000000030d] { gipcAddress :
name 'tcp://hract21.example.com', objFlags 0x0, addrFlags 0x4 }, flags 0x20020
2015-02-17 11:48:39.300928 :GIPCXCPT:3369244416:  gipcInternalEndpoint: failed to bind address to endpoint name 'tcp://hract21.example.com',
ret gipcretAddressNotAvailable (39)

Grep Command :
[grid@hract21 trace]$  grep "2015-02-17 11:4" * | egrep 'gipcmodNetworkProcessBind'
gipcd.trc:2015-02-17 11:48:38.129278 :GIPCXCPT:2967607040:  gipcmodNetworkProcessBind: slos op  :  sgipcnTcpBind
gipcd.trc:2015-02-17 11:48:38.129280 :GIPCXCPT:2967607040:  gipcmodNetworkProcessBind: slos dep :  Cannot assign requested address (99)
gipcd.trc:2015-02-17 11:48:38.129281 :GIPCXCPT:2967607040:  gipcmodNetworkProcessBind: slos loc :  bind
gipcd.trc:2015-02-17 11:48:38.129283 :GIPCXCPT:2967607040:  gipcmodNetworkProcessBind: slos info:  addr '192.168.7.121:0'
--> Grep comamnd is quite useful !

DTRACE SCRIPT :
/*
Generic DTRACE script tracking IP-Address and ports for  bind() system calls:
*/
syscall::bind:entry
{
self->fd = arg0;
self->sockaddr =  arg1;
sockaddrp  =(struct sockaddr *)copyin(self->sockaddr, sizeof(struct sockaddr));
s = (char * )sockaddrp;
self->port =  ( unsigned short )(*(s+3)) + ( unsigned short ) ((*(s+2)*256));
self->ip1=*(s+4);
self->ip2=*(s+5);
self->ip3=*(s+6);
self->ip4=*(s+7);
}

/*
Generic DTRACE script tracking failed bind() system calls:
*/
syscall::bind:return
/arg0<0 && execname != "crsctl.bin"/
{
printf("- Exec: %s - PID: %d  bind() failed with error : %d - fd : %d - IP: %d.%d.%d.%d - Port: %d " , execname, pid, arg0, self->fd,
self->ip1, self->ip2, self->ip3, self->ip4,    self->port  );
}

DTRACE OUTPUT :
[root@hract21 DTRACE]# dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0     89                   connect:return - Exec: mdnsd.bin - PID: 26518  connect() failed with error : -101 - fd : 39 - IP: 17.17.17.17 - Port: 256
0    103                      bind:return - Exec: gipcd.bin - PID: 26658  bind() failed with error : -99 - fd : 87 - IP: 192.168.7.121 - Port: 0
0    103                      bind:return - Exec: gipcd.bin - PID: 26696  bind() failed with error : -99 - fd : 87 - IP: 192.168.7.121 - Port: 0
0    103                      bind:return - Exec: gipcd.bin - PID: 26722  bind() failed with error : -99 - fd : 87 - IP: 192.168.7.121 - Port: 0
0    103                      bind:return - Exec: gipcd.bin - PID: 26740  bind() failed with error : -99 - fd : 87 - IP: 192.168.7.121 - Port: 0
0    103                      bind:return - Exec: gipcd.bin - PID: 26757  bind() failed with error : -99 - fd : 87 - IP: 192.168.7.121 - Port: 0

Investigate & Fix :
Check profile.xml
[root@hract21 network-scripts]#   $GRID_HOME/bin/gpnptool get 2>/dev/null  |  xmllint --format - | egrep 'CSS-Profile|ASM-Profile|Network id'
<gpnp:HostNetwork id="gen" HostName="*">
<gpnp:Network id="net1" IP="192.168.5.0" Adapter="eth1" Use="public"/>
<gpnp:Network id="net2" IP="192.168.2.0" Adapter="eth2" Use="asm,cluster_interconnect"/>
<orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/>
<orcl:ASM-Profile id="asm" DiscoveryString="/dev/asm*" SPFile="+DATA/ract2/ASMPARAMETERFILE/registry.253.870352347" Mode="remote"/>
-> eth1 is our PUBLIC network interface - with 192.168.5.0 as the related NETWORK address

[root@hract21 Desktop]# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 08:00:27:7D:8E:49
inet addr:192.168.5.121  Bcast:192.168.5.255  Mask:255.255.255.0
inet6 addr: fe80::a00:27ff:fe7d:8e49/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
--> ifconfig looks good but why is gipcd.bin picking up  192.168.7.121 ?
[root@hract21 Desktop]# ping hract21
PING hract21 (192.168.7.121) 56(84) bytes of data.
--> ping uses wrong address too and hangs

[root@hract21 Desktop]# grep hract21 /etc/hosts
192.168.7.121 hract21 hract21.example.com

FIX --> Modify hostname entry in /etc/hosts

 

Case VI : CSSD doesn’t start as Port 42424 is already in use

Force that  error and monitor Clusterware Resource status after startup:
[root@hract21 JAVA]#   java JavaUDPServer 42424
Listening on UDP Port: 42424
..
****  Local Resources: *****
Resource NAME               INST   TARGET    STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE    OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE    OFFLINE      -               STABLE
ora.crf                        1   ONLINE    ONLINE       hract21         STABLE
ora.crsd                       1   ONLINE    OFFLINE      -               STABLE
ora.cssd                       1   ONLINE    OFFLINE      hract21         STARTING
ora.cssdmonitor                1   ONLINE    ONLINE       hract21         STABLE
ora.ctssd                      1   ONLINE    OFFLINE      -               STABLE
ora.diskmon                    1   ONLINE    OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE    ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE    INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE    ONLINE       hract21         STABLE
ora.gpnpd                      1   ONLINE    INTERMEDIATE hract21         STABLE
ora.mdnsd                      1   ONLINE    ONLINE       hract21         STABLE
ora.storage                    1   ONLINE    OFFLINE      -               STABLE
--> GPNPD, CSSD daemon doen't start

CLUVFY:
Note: Was not able to to identify the problem with following cluvfy command:
[grid@hract21 CLUVFY]$ ssh hract22 cluvfy stage -pre crsinst -n hract21 -networks eth1:192.168.5.0:PUBLIC/eth2:192.168.2.0:cluster_interconnect -verbose

TRACEFILE review :
ocssd.trc:
2015-02-17 12:23:48.485840 :GIPCXCPT:884037376: gipcmodNetworkProcessBind: failed to bind endp 0x7f4f28237090 [0000000000002bcf] { gipcEndpoint : localAddr 'mcast://224.0.0.251:42424/192.168.2.121', remoteAddr '', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj (nil), sendp 0x7f4f282391b0 status 13flags 0x20000000, flags-2 0x0, usrFlags 0xc000 }, addr 0x7f4f28237b50 [0000000000002bd1] { gipcAddress : name 'mcast://224.0.0.251:42424/192.168.2.121', objFlags 0x0, addrFlags 0x5 }
2015-02-17 12:23:48.485840 :GIPCXCPT:884037376: gipcmodNetworkProcessBind: slos op  :  sgipcnMctBind
2015-02-17 12:23:48.485840 :GIPCXCPT:884037376: gipcmodNetworkProcessBind: slos dep :  Address already in use (98)
2015-02-17 12:23:48.485840 :GIPCXCPT:884037376: gipcmodNetworkProcessBind: slos loc :  bind
2015-02-17 12:23:48.485840 :GIPCXCPT:884037376: gipcmodNetworkProcessBind: slos info:  Invalid argument
GREP command :
$ grep "2015-02-17 12:" * | egrep 'gipcmodNetworkProcessBind'


DTRACE SCRIPT :
syscall::bind:entry
{
self->fd = arg0;
self->sockaddr =  arg1;
sockaddrp  =(struct sockaddr *)copyin(self->sockaddr, sizeof(struct sockaddr));
s = (char * )sockaddrp;
self->port =  ( unsigned short )(*(s+3)) + ( unsigned short ) ((*(s+2)*256));
self->ip1=*(s+4);
self->ip2=*(s+5);
self->ip3=*(s+6);
self->ip4=*(s+7);
}

/*
Generic DTRACE script tracking failed bind() system calls:
*/
syscall::bind:return
/arg0<0 && execname != "crsctl.bin"/
{
printf("- Exec: %s - PID: %d  bind() failed with error : %d - fd : %d - IP: %d.%d.%d.%d - Port: %d " , execname, pid, arg0, self->fd,
self->ip1, self->ip2, self->ip3, self->ip4,    self->port  );
}

DTRACE OUTPUT :
[root@hract21 DTRACE]# dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 19 probes
dtrace: buffer size lowered to 1m
dtrace: dynamic variable size lowered to 1m
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0    103                      bind:return - Exec: ocssd.bin - PID: 13064  bind() failed with error : -98 - fd : 211 - IP: 224.0.0.251 - Port: 42424
..
0     89                   connect:return - Exec: gipcd.bin - PID: 12957  connect() - fd : 27 - IP: 192.168.5.50 - Port: 53
0     89                   connect:return - Exec: gipcd.bin - PID: 12957  connect() - fd : 87 - IP: 192.168.5.50 - Port: 53
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 76 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 114 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 124 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 125 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: ocssd.bin - PID: 13064  bind() failed with error : -98 - fd : 133 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 126 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: ocssd.bin - PID: 13064  bind() failed with error : -98 - fd : 136 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 150 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: ocssd.bin - PID: 13064  bind() failed with error : -98 - fd : 137 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 151 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: ocssd.bin - PID: 13064  bind() failed with error : -98 - fd : 138 - IP: 224.0.0.251 - Port: 42424

Investigate & Fix :
[root@hract21 Desktop]#  netstat -taupen | egrep ":53 |:5353 |:42424"
udp        0      0 0.0.0.0:5353      501        35360021   12828/mdnsd.bin
udp        0      0 0.0.0.0:5353      501        35360016   12828/mdnsd.bin
udp        0      0 0.0.0.0:5353      501        35360015   12828/mdnsd.bin
udp        0      0 0.0.0.0:5353      501        35360014   12828/mdnsd.bin
udp   264064      0 0.0.0.0:42424       0          35356639   12631/java
--> The clusterware port 42422 is used by a java program with PID 12631

FIX -->  kill that process with
# kill -9 12631
After that Clusterware should start immediate

 

Case VII :  MDNSD doesn’t start as Port 5353 is already in use

Status lower Clusterware resources aftert startup:

*****  Local Resources: *****
Resource NAME               INST   TARGET       STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE       OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE       OFFLINE      -               STABLE
ora.crf                        1   ONLINE       ONLINE       hract21         STABLE
ora.crsd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssdmonitor                1   ONLINE       ONLINE       hract21         STABLE
ora.ctssd                      1   ONLINE       OFFLINE      -               STABLE
ora.diskmon                    1   ONLINE       OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE       ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE       INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE       OFFLINE      -               STABLE
ora.gpnpd                      1   ONLINE       ONLINE       hract21         STABLE
ora.mdnsd                      1   ONLINE       INTERMEDIATE hract21         STABLE
ora.storage                    1   ONLINE       OFFLINE      -               STABLE
--> MDNSD doesn't start 
GREP command:
[grid@hract21 trace]$    grep "2015-02-17 12:4" * | egrep 'Address already in use'
mdnsd.trc:
2015-02-17 12:43:26.211079 :  CLSDMT:2281699072: PID for the Process [19764], connkey 9
2015-02-17 12:43:27.193282 :    MDNS:2353129024:  mdnsd interface eth0 (0x2 AF=2 f=0x1043 mcast=-1) 
                                192.168.1.7 mask 255.255.255.0 FAILED. Error 98 (Address already in use)
2015-02-17 12:43:27.194932 :    MDNS:2353129024:  mdnsd interface eth1 (0x3 AF=2 f=0x1043 mcast=-1) 192.168.5.121 mask 255.255.255.0 FAILED. Error 98 (Address already in use)
2015-02-17 12:43:27.194986 :    MDNS:2353129024:  mdnsd interface eth2 (0x4 AF=2 f=0x1043 mcast=-1) 
                                192.168.2.121 mask 255.255.255.0 FAILED. Error 98 (Address already in use)
2015-02-17 12:43:27.198670 :    MDNS:2353129024:  mdnsd interface eth3 (0x5 AF=2 f=0x1043 mcast=-1) 
                                192.168.3.121 mask 255.255.255.0 FAILED. Error 98 (Address already in use)
2015-02-17 12:43:27.198723 :    MDNS:2353129024:  mdnsd interface lo (0x1 AF=2 f=0x49 mcast=-1) 127.0.0.1 mask 255.0.0.0 FAILED. Error 98 (Address already in use)
2015-02-17 12:43:27.198726 :    MDNS:2353129024:  Error! No valid netowrk interfaces found to setup mDNS.
2015-02-17 12:43:27.198729 :    MDNS:2353129024:  Oracle mDNSResponder ver. mDNSResponder-1076 (Jun 30 2014 19:39:45) , init_rv=-65537
2015-02-17 12:43:27.198818 :    MDNS:2353129024:  stopping

CLUVFY :
Following cluvfy command doesn't detect the problem:
[grid@hract21 CLUVFY]$  ssh hract22 cluvfy stage -pre crsinst -n hract21,hract22 -networks eth1:192.168.5.0:PUBLIC/eth2:192.168.2.0:cluster_interconnect

DTRACE SCRIPT :
syscall::bind:entry
{
self->fd = arg0;
self->sockaddr =  arg1;
sockaddrp  =(struct sockaddr *)copyin(self->sockaddr, sizeof(struct sockaddr));
s = (char * )sockaddrp;
self->port =  ( unsigned short )(*(s+3)) + ( unsigned short ) ((*(s+2)*256));
self->ip1=*(s+4);
self->ip2=*(s+5);
self->ip3=*(s+6);
self->ip4=*(s+7);
}

/*
Generic DTRACE script tracking failed bind() system calls:
*/
syscall::bind:return
/arg0<0 && execname != "crsctl.bin"/
{
printf("- Exec: %s - PID: %d  bind() failed with error : %d - fd : %d - IP: %d.%d.%d.%d - Port: %d " , execname, pid, arg0, self->fd,
self->ip1, self->ip2, self->ip3, self->ip4,    self->port  );
}

DTRACE OUTPUT :
[root@hract21 DTRACE]# dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 19 probes
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0     93                    sendto:return - Exec: ohasd.bin - PID: 17321  sendto() failed with error : -32 - fd : 173
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0    103                      bind:return - Exec: mdnsd.bin - PID: 18943  bind() failed with error : -98 - fd : 33 - IP: 0.0.0.0 - Port: 5353
0    103                      bind:return - Exec: mdnsd.bin - PID: 18943  bind() failed with error : -98 - fd : 34 - IP: 0.0.0.0 - Port: 5353
0    103                      bind:return - Exec: mdnsd.bin - PID: 18943  bind() failed with error : -98 - fd : 34 - IP: 0.0.0.0 - Port: 5353
0    103                      bind:return - Exec: mdnsd.bin - PID: 18943  bind() failed with error : -98 - fd : 34 - IP: 0.0.0.0 - Port: 5353
0    103                      bind:return - Exec: mdnsd.bin - PID: 18943  bind() failed with error : -98 - fd : 34 - IP: 0.0.0.0 - Port: 5353
0    103                      bind:return - Exec: mdnsd.bin - PID: 18943  bind() failed with error : -98 - fd : 34 - IP: 0.0.0.0 - Port: 5353

Investigate & Fix :
[root@hract21 network-scripts]#  netstat -taupen | egrep ":53 |:5353 |:42424"
udp        0      0 0.0.0.0:5353          0  36230279   18804/ohasd.bin
udp        0      0 230.0.1.0:42424       0  36230278   18804/ohasd.bin
udp        0      0 224.0.0.251:42424     0  35356639   12631/java
--> The clusterware port 5353 is used by a java program with PID 17263

FIX : kill that process with and restart CW kill -9 17263

Case VIII: GIPCD, GPNPD CSDD not starting as Nameserver is not reachable

  • OS Error [ ECONNREFUSED    111 -> Connection refused ]
Force that  error and monitor Clusterware Resource status after startup:
On the nameserver run:
[root@ns1 ~]# service named stop
Stopping named: .                                          [  OK  ]
*****  Local Resources: *****
Resource NAME               INST   TARGET    STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE    OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE    OFFLINE      -               STABLE
ora.crf                        1   ONLINE    ONLINE       hract21         STABLE
ora.crsd                       1   ONLINE    OFFLINE      -               STABLE
ora.cssd                       1   ONLINE    OFFLINE      hract21         STARTING
ora.cssdmonitor                1   ONLINE    ONLINE       hract21         STABLE
ora.ctssd                      1   ONLINE    OFFLINE      -               STABLE
ora.diskmon                    1   ONLINE    OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE    ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE    INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE    OFFLINE      -               STABLE
ora.gpnpd                      1   ONLINE    INTERMEDIATE hract21         STABLE
ora.mdnsd                      1   ONLINE    ONLINE       hract21         STABLE
ora.storage                    1   ONLINE    OFFLINE      -               STABLE
--> GIPCD, GPNPD CSDD not starting

CLUVFY :
[grid@hract21 trace]$ ~/CLUVFY/bin/cluvfy -version
PRVF-0002 : Could not retrieve local nodename

TRACEFILE review :
Grep trace files for any Resolve errors [ OS function:  getaddrinfo() ]
[grid@hract21 trace]$ grep "2015-02-17 14:1" * | grep gipcmodNetworkResolve
gipcd.trc:
2015-02-17 14:13:36.137197 :GIPCXCPT:2309576448:  gipcInternalEndpoint: failed to bind address to endpoint name 'tcp://hract21.example.com', ret gipcretFail (1)
2015-02-17 14:13:41.141266 :GIPCXCPT:2309576448:  gipcmodNetworkResolve: failed to create new address for osName 'hract21.example.com', name 'tcp://hract21.example.com'
2015-02-17 14:13:41.141285 :GIPCXCPT:2309576448:  gipcmodNetworkResolve: slos op  :  sgipcnPopulateAddrInfo
2015-02-17 14:13:41.141289 :GIPCXCPT:2309576448:  gipcmodNetworkResolve: slos dep :  Connection refused (111)
2015-02-17 14:13:41.141293 :GIPCXCPT:2309576448:  gipcmodNetworkResolve: slos loc :  getaddrinfo(
2015-02-17 14:13:41.141297 :GIPCXCPT:2309576448:  gipcmodNetworkResolve: slos info:  server not available,try again
2015-02-17 14:13:41.141342 :GIPCXCPT:2309576448:  gipcResolveF [gipcInternalBind : gipcInternal.c : 537]: EXCEPTION[ ret gipcretFail (1) ]  
          failed to resolve address 0x7fd764033be0 [0000000000000310] 
          { gipcAddress : name 'tcp://hract21.example.com', objFlags 0x0, addrFlags 0x8 }, flags 0x4000
2015-02-17 14:13:41.141365 :GIPCXCPT:2309576448:  gipcBindF [gipcInternalEndpoint : gipcInternal.c : 468]: EXCEPTION[ ret gipcretFail (1) ]  failed to bind endp 0x7fd764033070 [000000000000030e] { gipcEndpoint : localAddr 'tcp://hract21.example.com', remoteAddr '', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj (nil), sendp (nil) status 13flags 0x40008000, flags-2 0x0, usrFlags 0x240a0 }, addr 0x7fd764034890 [0000000000000315] { gipcAddress : name 'tcp://hract21.example.com', objFlags 0x0, addrFlags 0x8 }, flags 0x200a0

DTRACE SCRIPT helper:
Use strace to get an idea how to write a working DTRACE script
22752 connect(27, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.5.50")}, 16 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth2", ifr_broadaddr={AF_INET, inet_addr("192.168.2.255")}}) = 0
22752 <... connect resumed> )           = 0
22750 ioctl(28, SIOCGIFFLAGS <unfinished ...>
22752 poll([{fd=27, events=POLLOUT}], 1, 0 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth3", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
22752 <... poll resumed> )              = 1 ([{fd=27, revents=POLLOUT}])
22750 ioctl(28, SIOCGIFADDR <unfinished ...>
22752 sendto(27, "\320X\1\0\0\1\0\0\0\0\0\0\7hract21\7example\3com"..., 37, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth3", ifr_addr={AF_INET, inet_addr("192.168.3.121")}}) = 0
22752 <... sendto resumed> )            = 37
22750 ioctl(28, SIOCGIFNETMASK <unfinished ...>
22752 poll([{fd=27, events=POLLIN|POLLOUT}], 1, 5000 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth3", ifr_netmask={AF_INET, inet_addr("255.255.255.0")}}) = 0
22752 <... poll resumed> )              = 1 ([{fd=27, revents=POLLOUT}])
22750 ioctl(28, SIOCGIFBRDADDR <unfinished ...>
22752 sendto(27, "\16\227\1\0\0\1\0\0\0\0\0\0\7hract21\7example\3com"..., 37, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth3", ifr_broadaddr={AF_INET, inet_addr("192.168.3.255")}}) = 0
22752 <... sendto resumed> )            = -1 ECONNREFUSED (Connection refused)

--> connect call (  works with fd=27 works - parameter 2 of our connect call holds the IP adresss
    The following sendto() call ( sendto(27,.. )  fails with error ECONNREFUSED
    To select the right sendto call you need to use the PID ( 22752 ) and the filedescriptor fd=27 (   sendto(27, .. )

Requirements for DTRACE script details
- Collect info about the IP adress from a former connect() call ( we need to trace all conenct calls )
- Trace the sendto call for errors like ( ECONNREFUSED )
- Use Filedescriptor fd ( fd=27 ) to tie up the connect call and the sendto
- Always attach strace to the gipcd process to get an idea whether your oracle versions
  executes the same system calls in the same order

DTRACE SCRIPT :
syscall::connect:return
/self->port == ns_ip_port && execname != "crsctl.bin" /
{
printf("- Exec: %s - PID: %d  connect() - fd : %d - IP: %d.%d.%d.%d - Port: %d " , execname, pid, self->fd,
self->ip1, self->ip2, self->ip3, self->ip4, self->port  );
}

syscall::sendto:entry
/execname != "crsctl.bin" /
{
self->fds = arg0;
}

syscall::sendto:return
/arg0<0 &&  execname != "crsctl.bin"  /
{
printf("- Exec: %s - PID: %d  sendto() failed with error : %d - fd : %d " , execname, pid, arg0, self->fds );
}

DTRACE output:
[root@hract21 DTRACE]# !dt
dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0     93                    sendto:return - Exec: orarootagent.bi - PID: 29204  sendto() failed with error : -111 - fd : 15
0     93                    sendto:return - Exec: oraagent.bin - PID: 29308  sendto() failed with error : -111 - fd : 15
0     93                    sendto:return - Exec: oraagent.bin - PID: 29308  sendto() failed with error : -111 - fd : 15
0     89                   connect:return - Exec: gipcd.bin - PID: 29363  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0     93                    sendto:return - Exec: gipcd.bin - PID: 29363  sendto() failed with error : -111 - fd : 27
0     89                   connect:return - Exec: gipcd.bin - PID: 29363  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0     93                    sendto:return - Exec: mdnsd.bin - PID: 29320  sendto() failed with error : -111 - fd : 7
0     93                    sendto:return - Exec: gpnpd.bin - PID: 29343  sendto() failed with error : -111 - fd : 15
--> In this sample the gipcd.bin is failing to communicate with the namesever
    The failed system call is sendto() - Error  ECONNREFUSED    111  - Connection refused following a
       successfull connect() system call.
   Note: Filedescritor  fd=27 signals that connect() and sendto() system call operates on the same socket/file discriptor

Investigate & Fix
[root@hract21 network-scripts]# ping ns1.example.com
ping: unknown host ns1.example.com

--> Fix : Restart your nameserver and check nameserver IP-Addres/Port

 

Reference:

One thought on “Troubleshooting Clusterware startup problems with DTRACE”

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>