Troubleshooting Clusterware startup problems with DTRACE

Case II : GIPCD daemon doesn’t start as HOSTNAME.pid file is not readable

Clusterware 12.1.0.2 uses the follow HOSTNAME.pid files for reporting the PID .
If CW can't write to that PID the related CW componet may not start
/u01/app/121/grid/ohasd/init/hract21.pid
/u01/app/121/grid/osysmond/init/hract21.pid
/u01/app/121/grid/gpnp/init/hract21.pid
/u01/app/121/grid/gipc/init/hract21.pid
/u01/app/121/grid/log/hract21/gpnpd/hract21.pid
/u01/app/121/grid/ctss/init/hract21.pid
/u01/app/121/grid/gnsd/init/hract21.pid
/u01/app/121/grid/crs/init/hract21.pid
/u01/app/121/grid/crf/admin/run/crflogd/lhract21.pid
/u01/app/121/grid/crf/admin/run/crfmond/shract21.pid
/u01/app/121/grid/evm/init/hract21.pid
/u01/app/121/grid/mdns/init/hract21.pid
/u01/app/121/grid/ologgerd/init/hract21.pid

Create error and monotor Clusterware Resource status after startup:
[root@hract21 DTRACE]# ls -l /u01/app/121/grid/gipc/init/hract21.pid
-rw-r--r-- 1 grid oinstall 6 Feb 16 10:30 /u01/app/121/grid/gipc/init/hract21.pid
[root@hract21 DTRACE]# chmod 000  /u01/app/121/grid/gipc/init/hract21.pid

*****  Local Resources: *****
Resource NAME               INST   TARGET       STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE       OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE       OFFLINE      -               STABLE
ora.crf                        1   ONLINE       OFFLINE      -               STABLE
ora.crsd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssdmonitor                1   OFFLINE      OFFLINE      -               STABLE
ora.ctssd                      1   ONLINE       OFFLINE      -               STABLE
ora.diskmon                    1   OFFLINE      OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE       ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE       INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE       OFFLINE      hract21         STARTING
ora.gpnpd                      1   ONLINE       INTERMEDIATE hract21         STABLE
ora.mdnsd                      1   ONLINE       ONLINE       hract21         STABLE
ora.storage                    1   ONLINE      OFFLINE      -               STABLE
--> GIPCD  doensn't  start 
CLUVFY:
Found no cluvfy command to detect this error

TRACFILE review :
alert.log :
Mon Feb 16 12:09:03 2015
Errors in file /u01/app/grid/diag/crs/hract21/crs/trace/gipcd.trc  (incident=2921):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2921/gipcd_i2921.trc
2015-02-16 12:09:03.181 [GIPCD(14763)]CRS-8503: Oracle Clusterware GIPCD process with operating system process ID 14763
experienced fatal signal or exception code 6 - Sweep [inc][2921]: completed
--> Got no further indications that file permissons on file /u01/app/121/grid/gipc/init/hract21.pid are the root cause

DTRACE SCRIPT:
syscall::open:entry
{
self->path = copyinstr(arg0);
}

syscall::open:return
/arg0<0 && execname!= "crsctl.bin" && substr( self->path,0,grid_len)==  grid_loc &&  strstr(self->path, pid_file ) == pid_file  /
{
printf("- Exec: %s - open() %s failed with error: %d - scan_dir:  %s - PID-File : %s ", execname, self->path, arg0, substr( self->path,0,grid_len), pid_file );
}

DTRACE OUTPUT :
DTrace helps us to find that problem very quickly :
[root@hract21 DTRACE]#  dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: oraagent.bin - open() /u01/app/121/grid/gipc/init/hract21.pid failed with error: -13 - scan_dir:  /u01/app/121/grid - PID-File : hract21.pid
0     89                   connect:return - Exec: mdnsd.bin - PID: 19658  connect() failed with error : -101 - fd : 39 - IP: 17.17.17.17 - Port: 256
0     89                   connect:return - Exec: gipcd.bin - PID: 19702  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0      9                      open:return - Exec: gipcd.bin - open() /u01/app/121/grid/gipc/init/hract21.pid failed with error: -13 - scan_dir:  /u01/app/121/grid - PID-File : hract21.pid
0      9                      open:return - Exec: gipcd.bin - open() /u01/app/121/grid/gipc/init/hract21.pid failed with error: -13 - scan_dir:  /u01/app/121/grid - PID-File : hract21.pid

FIX :
Change permission and reboot Clusterware :
[root@hract21 DTRACE]# chmod 644 /u01/app/121/grid/gipc/init/hract21.pid
[root@hract21 DTRACE]# ls -l /u01/app/121/grid/gipc/init/hract21.pid
-rw-r--r-- 1 grid oinstall 5 Feb 16 11:37 /u01/app/121/grid/gipc/init/hract21.pid

2 thoughts on “Troubleshooting Clusterware startup problems with DTRACE”

Leave a Reply

Your email address will not be published. Required fields are marked *