Troubleshooting Clusterware startup problems with DTRACE

Case I: CRSD daemon doesn’t start due to permission problem ( crsd/crsd.bin)

Force that  error and monitor Clusterware Resource status after startup:

The command below prints the status of the lower CRS stack every 2 seconds
This will allow you to track which compomet is starting again and finally failing 
# watch crsi
*****  Local Resources: *****
Resource NAME               INST   TARGET      STATE        SERVER          STATE_DETAILS
--------------------------- ----   ---------- ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE    ONLINE       hract21         STABLE
ora.cluster_interconnect.haip  1   ONLINE    ONLINE       hract21         STABLE
ora.crf                        1   ONLINE    ONLINE       hract21         STABLE
ora.crsd                       1   ONLINE    OFFLINE      -               STABLE
ora.cssd                       1   ONLINE    ONLINE       hract21         STABLE
ora.cssdmonitor                1   ONLINE    ONLINE       hract21         STABLE
ora.ctssd                      1   ONLINE    ONLINE       hract21         OBSERVER,STABLE
ora.diskmon                    1   OFFLINE   OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE    ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE    INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE    ONLINE       hract21         STABLE
ora.gpnpd                      1   ONLINE    ONLINE       hract21         STABLE
ora.mdnsd                      1   ONLINE    ONLINE       hract21         STABLE
ora.storage                    1   ONLINE    ONLINE       hract21         STABLE
--> CRSD doesn't start

CaseI-1: Shell script /u01/app/121/grid/bin/crsd with wrong protection
# chmod 000 /u01/app/121/grid/bin/crsd

CLUVFY:
Note as our CRS stack is not fully running on hract21 we have to start cluvfy from hract22
[grid@hract21 CLUVFY]$ ssh hract22  ~/CLUVFY/bin/cluvfy comp software -n hract21
Verifying software
Check: Software
Component: crs
Node Name: hract21
   /u01/app/121/grid/bin/crsd..."Permissions" did not match reference
3162 files verified
Software check failed

TRACFILE review :
Use Grep command:
# grep "2015-02-16 09:2" *  | egrep "no exe permission|failed to start process"

alert.log:
2015-02-16 09:25:05.681 [ORAROOTAGENT(16807)]CRS-5013: Agent "ORAROOTAGENT" failed to start process "/u01/app/121/grid/bin/crsd"
for action "start": details at "(:CLSN00008:)" in "/u01/app/grid/diag/crs/hract21/crs/trace/ohasd_orarootagent_root.trc"
2015-02-16 09:25:06.706 [ORAROOTAGENT(16807)]CRS-5013: Agent "ORAROOTAGENT" failed to start process "/u01/app/121/grid/bin/crsd"
for action "start": details at "(:CLSN00008:)" in "/u01/app/grid/diag/crs/hract21/crs/trace/ohasd_orarootagent_root.trc"

/u01/app/grid/diag/crs/hract21/crs/trace/ohasd_orarootagent_root.trc:
2015-02-16 09:25:06.704671 :CLSDYNAM:3169421056: [ora.crsd]{0:0:2} [start] (:CLSN00008:)Utils:execCmd scls_process_spawn() failed 1
2015-02-16 09:25:06.704859 :CLSDYNAM:3169421056: [ora.crsd]{0:0:2} [start] (:CLSN00008:) category: -1, operation: fail, loc: canexec2,
OS error: 0, other: no exe permission, file [/u01/app/121/grid/bin/crsd]
2015-02-16 09:25:06.705358 :CLSDYNAM:3169421056: [ora.crsd]{0:0:2} [start] clsnUtils::error Exception type=2 string=
CRS-5013: Agent "ORAROOTAGENT" failed to start process "/u01/app/121/grid/bin/crsd" for action "start":
details at "(:CLSN00008:)" in "/u01/app/grid/diag/crs/hract21/crs/trace/ohasd_orarootagent_root.trc"

DTRACE :
DTRACE is not helpful here !
--> Using dtrace will not help as  OS error: 0
   CW software reads the protections of crsd shell script and stops clusterware startup if protections are not sufficient.

FIX :
Change permsission and restart CRS
[root@hract21 Desktop]#  ls -l    /u01/app/121/grid/bin/crsd
---------- 1 root oinstall 9419 Jan 30 12:26 /u01/app/121/grid/bin/crsd
[root@hract21 Desktop]# chmod 755  /u01/app/121/grid/bin/crsd
[root@hract21 Desktop]# ls -l    /u01/app/121/grid/bin/crsd
rwxr-xr-x 1 root oinstall 9419 Jan 30 12:26 /u01/app/121/grid/bin/crsd


CaseI-2: Executable /u01/app/121/grid/bin/crsd.bin with wrong protection
#  ls -l  /u01/app/121/grid/bin/crsd.bin
-rwxr----x 1 root oinstall 369207571 Jun 30  2014 /u01/app/121/grid/bin/crsd.bin
# chmod 000 /u01/app/121/grid/bin/crsd.bin

CLUVFY :
Note as our CRS stack is not fully running on hract21 we have to start cluvfy from hract22
[grid@hract21 CLUVFY]$ ssh hract22  ~/CLUVFY/bin/cluvfy comp software -n hract21
Verifying software
Check: Software
Component: crs
Node Name: hract21
    /u01/app/121/grid/bin/crsd.bin..."Permissions" did not match reference
3162 files verified
Software check failed

TRACFILE review :
Grep command:
$  grep "2015-02-16 10:3" *  | egrep " fatal | terminating "

alert.log
2015-02-16 10:30:04.075 [OCSSD(29033)]CRS-1713: CSSD daemon is started in hub mode
2015-02-16 10:30:04.271 [OCSSD(29033)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc
2015-02-16 10:30:05.297 [OCSSD(29033)]CRS-1652: Starting clean up of CRSD resources.
2015-02-16 10:30:05.307 [OCSSD(29033)]CRS-1653: The clean up of the CRSD resources failed.
Mon Feb 16 10:30:05 2015
Errors in file /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc  (incident=2897):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2897/ocssd_i2897.trc

2015-02-16 10:30:05.475 [OCSSD(29033)]CRS-8503: Oracle Clusterware OCSSD process with operating system process ID 29033 experienced fatal signal or exception code 6
Sweep [inc][2897]: completed

--> The above mentioned tracefiles doesn't help - so what is the real problem here ?

DTRACE SCRIPT :
/*
Tracing all failed  execve calls and stops tracing after oradism was started by calling exit()
*/
syscall::execve:entry
{
self->path = copyinstr(arg0);
}

/*
Trace all failed execve call on  the system
This allows us to track failed calls for OS utils like awk, sed .....
*/
syscall::execve:return
/arg0<0 && execname != "crsctl.bin"/
{
printf("- Exec: %s - execve()    %s failed with error: %d", execname, self->path, arg0 );
self->path = 0;
}

/*
Track successful CW start - stop tracing after oradism was started by calling exit()
*/

syscall::execve:return
/execname == "oradism"/
{
printf("- Exec: %s - execve()  %s -  Lower CLUSTERWARE stack successfully started - ret code:  %d - EXITING !", execname, self->path, arg0 );
self->path = 0;
exit(0);
}

DTRACE OUTPUT :
Using our dtrace script will quickly identify the root cause  [ crsd.bin failed due to Permission error ]
[root@hract21 DTRACE]# dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0     89                   connect:return - Exec: mdnsd.bin - PID: 17355  connect() failed with error : -101 - fd : 39 - IP: 17.17.17.17 - Port: 256
0     89                   connect:return - Exec: gipcd.bin - PID: 17400  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0     89                   connect:return - Exec: gipcd.bin - PID: 17412  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0     89                   connect:return - Exec: gipcd.bin - PID: 17429  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0    123                    execve:return - Exec: crsd - execve()    /u01/app/121/grid/bin/crsd.bin failed with error: -13

FIX :
Change permsission and restart CRS
[root@hract21 Desktop]#  ls -l    /u01/app/121/grid/bin/crsd.bin
---------- 1 root oinstall 369207571 Jun 30  2014 /u01/app/121/grid/bin/crsd.bin
[root@hract21 Desktop]# chmod 741 /u01/app/121/grid/bin/crsd.bin
[root@hract21 Desktop]# ls -l    /u01/app/121/grid/bin/crsd.bin
[root@hract21 network-scripts]# ls -l    /u01/app/121/grid/bin/crsd.bin
-rwxr----x 1 root oinstall 369207571 Jun 30  2014 /u01/app/121/grid/bin/crsd.bin

2 thoughts on “Troubleshooting Clusterware startup problems with DTRACE”

Leave a Reply

Your email address will not be published. Required fields are marked *