Troubleshooting Clusterware startup problems with DTRACE

Case VI : CSSD doesn’t start as Port 42424 is already in use

Force that  error and monitor Clusterware Resource status after startup:
[root@hract21 JAVA]#   java JavaUDPServer 42424
Listening on UDP Port: 42424
..
****  Local Resources: *****
Resource NAME               INST   TARGET    STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE    OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE    OFFLINE      -               STABLE
ora.crf                        1   ONLINE    ONLINE       hract21         STABLE
ora.crsd                       1   ONLINE    OFFLINE      -               STABLE
ora.cssd                       1   ONLINE    OFFLINE      hract21         STARTING
ora.cssdmonitor                1   ONLINE    ONLINE       hract21         STABLE
ora.ctssd                      1   ONLINE    OFFLINE      -               STABLE
ora.diskmon                    1   ONLINE    OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE    ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE    INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE    ONLINE       hract21         STABLE
ora.gpnpd                      1   ONLINE    INTERMEDIATE hract21         STABLE
ora.mdnsd                      1   ONLINE    ONLINE       hract21         STABLE
ora.storage                    1   ONLINE    OFFLINE      -               STABLE
--> GPNPD, CSSD daemon doen't start

CLUVFY:
Note: Was not able to to identify the problem with following cluvfy command:
[grid@hract21 CLUVFY]$ ssh hract22 cluvfy stage -pre crsinst -n hract21 -networks eth1:192.168.5.0:PUBLIC/eth2:192.168.2.0:cluster_interconnect -verbose

TRACEFILE review :
ocssd.trc:
2015-02-17 12:23:48.485840 :GIPCXCPT:884037376: gipcmodNetworkProcessBind: failed to bind endp 0x7f4f28237090 [0000000000002bcf] { gipcEndpoint : localAddr 'mcast://224.0.0.251:42424/192.168.2.121', remoteAddr '', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj (nil), sendp 0x7f4f282391b0 status 13flags 0x20000000, flags-2 0x0, usrFlags 0xc000 }, addr 0x7f4f28237b50 [0000000000002bd1] { gipcAddress : name 'mcast://224.0.0.251:42424/192.168.2.121', objFlags 0x0, addrFlags 0x5 }
2015-02-17 12:23:48.485840 :GIPCXCPT:884037376: gipcmodNetworkProcessBind: slos op  :  sgipcnMctBind
2015-02-17 12:23:48.485840 :GIPCXCPT:884037376: gipcmodNetworkProcessBind: slos dep :  Address already in use (98)
2015-02-17 12:23:48.485840 :GIPCXCPT:884037376: gipcmodNetworkProcessBind: slos loc :  bind
2015-02-17 12:23:48.485840 :GIPCXCPT:884037376: gipcmodNetworkProcessBind: slos info:  Invalid argument
GREP command :
$ grep "2015-02-17 12:" * | egrep 'gipcmodNetworkProcessBind'


DTRACE SCRIPT :
syscall::bind:entry
{
self->fd = arg0;
self->sockaddr =  arg1;
sockaddrp  =(struct sockaddr *)copyin(self->sockaddr, sizeof(struct sockaddr));
s = (char * )sockaddrp;
self->port =  ( unsigned short )(*(s+3)) + ( unsigned short ) ((*(s+2)*256));
self->ip1=*(s+4);
self->ip2=*(s+5);
self->ip3=*(s+6);
self->ip4=*(s+7);
}

/*
Generic DTRACE script tracking failed bind() system calls:
*/
syscall::bind:return
/arg0<0 && execname != "crsctl.bin"/
{
printf("- Exec: %s - PID: %d  bind() failed with error : %d - fd : %d - IP: %d.%d.%d.%d - Port: %d " , execname, pid, arg0, self->fd,
self->ip1, self->ip2, self->ip3, self->ip4,    self->port  );
}

DTRACE OUTPUT :
[root@hract21 DTRACE]# dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 19 probes
dtrace: buffer size lowered to 1m
dtrace: dynamic variable size lowered to 1m
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0    103                      bind:return - Exec: ocssd.bin - PID: 13064  bind() failed with error : -98 - fd : 211 - IP: 224.0.0.251 - Port: 42424
..
0     89                   connect:return - Exec: gipcd.bin - PID: 12957  connect() - fd : 27 - IP: 192.168.5.50 - Port: 53
0     89                   connect:return - Exec: gipcd.bin - PID: 12957  connect() - fd : 87 - IP: 192.168.5.50 - Port: 53
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 76 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 114 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 124 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 125 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: ocssd.bin - PID: 13064  bind() failed with error : -98 - fd : 133 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 126 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: ocssd.bin - PID: 13064  bind() failed with error : -98 - fd : 136 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 150 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: ocssd.bin - PID: 13064  bind() failed with error : -98 - fd : 137 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: gipcd.bin - PID: 12957  bind() failed with error : -98 - fd : 151 - IP: 224.0.0.251 - Port: 42424
0    103                      bind:return - Exec: ocssd.bin - PID: 13064  bind() failed with error : -98 - fd : 138 - IP: 224.0.0.251 - Port: 42424

Investigate & Fix :
[root@hract21 Desktop]#  netstat -taupen | egrep ":53 |:5353 |:42424"
udp        0      0 0.0.0.0:5353      501        35360021   12828/mdnsd.bin
udp        0      0 0.0.0.0:5353      501        35360016   12828/mdnsd.bin
udp        0      0 0.0.0.0:5353      501        35360015   12828/mdnsd.bin
udp        0      0 0.0.0.0:5353      501        35360014   12828/mdnsd.bin
udp   264064      0 0.0.0.0:42424       0          35356639   12631/java
--> The clusterware port 42422 is used by a java program with PID 12631

FIX -->  kill that process with
# kill -9 12631
After that Clusterware should start immediate

2 thoughts on “Troubleshooting Clusterware startup problems with DTRACE”

Leave a Reply

Your email address will not be published. Required fields are marked *