Troubleshooting Clusterware startup problems with DTRACE

Case V  :  GIPCD  doesn’t start – mismatch between profile.xml and the PUBLIC  INTERFACE address

Potential problem:

  • /etc/hosts and nslookup not in sync
  • PUBLIC interfase was changed without changing profile.xml
  • DNS returned a wrong host address
Force that  error and monitor Clusterware Resource status after startup:
*****  Local Resources: *****
Resource NAME               INST   TARGET       STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE       OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE       OFFLINE      -               STABLE
ora.crf                        1   ONLINE       ONLINE       hract21         STABLE
ora.crsd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssdmonitor                1   ONLINE       ONLINE       hract21         STABLE
ora.ctssd                      1   ONLINE       OFFLINE      -               STABLE
ora.diskmon                    1   ONLINE       OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE       ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE       INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE       OFFLINE      -               STABLE
ora.gpnpd                      1   ONLINE       ONLINE       hract21         STABLE
ora.mdnsd                      1   ONLINE       ONLINE       hract21         STABLE
ora.storage                    1   ONLINE       OFFLINE      -               STABLE
--> GIPCS doesn't start 
CLUVFY:
[grid@hract21 CLUVFY]$ cluvfy  comp nodecon -n hract21,hract22
Verifying node connectivity
ERROR:
PRVF-6006 : unable to reach the IP addresses "hract21,hract22" from the local node
PRKC-1071 : Nodes "hract21,hract22" did not respond to ping in "3" seconds,
PRKN-1035 : Host "hract21" is unreachable
PRKN-1035 : Host "hract22" is unreachable
Verification cannot proceed
Verification of node connectivity was unsuccessful on all the specified nodes.

TRACEFILE review :
gipcd.trc:
2015-02-17 11:48:39.300878 :GIPCXCPT:3369244416:  gipcmodNetworkProcessBind: slos op  :  sgipcnTcpBind
2015-02-17 11:48:39.300880 :GIPCXCPT:3369244416:  gipcmodNetworkProcessBind: slos dep :  Cannot assign requested address (99)
2015-02-17 11:48:39.300882 :GIPCXCPT:3369244416:  gipcmodNetworkProcessBind: slos loc :  bind
2015-02-17 11:48:39.300884 :GIPCXCPT:3369244416:  gipcmodNetworkProcessBind: slos info:  addr '192.168.7.121:0'
2015-02-17 11:48:39.300920 :GIPCXCPT:3369244416:  gipcBindF [gipcInternalEndpoint : gipcInternal.c : 468]: EXCEPTION[ ret gipcretAddressNotAvailable (39) ]
failed to bind endp 0x7fb6a4027990 [0000000000000306] { gipcEndpoint : localAddr 'tcp://192.168.7.121', remoteAddr '', numPend 0,
numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj (nil), sendp 0x7fb6a4033bd0
status 13flags 0x20008000, flags-2 0x0, usrFlags 0x20020 }, addr 0x7fb6a4033070 [000000000000030d] { gipcAddress :
name 'tcp://hract21.example.com', objFlags 0x0, addrFlags 0x4 }, flags 0x20020
2015-02-17 11:48:39.300928 :GIPCXCPT:3369244416:  gipcInternalEndpoint: failed to bind address to endpoint name 'tcp://hract21.example.com',
ret gipcretAddressNotAvailable (39)

Grep Command :
[grid@hract21 trace]$  grep "2015-02-17 11:4" * | egrep 'gipcmodNetworkProcessBind'
gipcd.trc:2015-02-17 11:48:38.129278 :GIPCXCPT:2967607040:  gipcmodNetworkProcessBind: slos op  :  sgipcnTcpBind
gipcd.trc:2015-02-17 11:48:38.129280 :GIPCXCPT:2967607040:  gipcmodNetworkProcessBind: slos dep :  Cannot assign requested address (99)
gipcd.trc:2015-02-17 11:48:38.129281 :GIPCXCPT:2967607040:  gipcmodNetworkProcessBind: slos loc :  bind
gipcd.trc:2015-02-17 11:48:38.129283 :GIPCXCPT:2967607040:  gipcmodNetworkProcessBind: slos info:  addr '192.168.7.121:0'
--> Grep comamnd is quite useful !

DTRACE SCRIPT :
/*
Generic DTRACE script tracking IP-Address and ports for  bind() system calls:
*/
syscall::bind:entry
{
self->fd = arg0;
self->sockaddr =  arg1;
sockaddrp  =(struct sockaddr *)copyin(self->sockaddr, sizeof(struct sockaddr));
s = (char * )sockaddrp;
self->port =  ( unsigned short )(*(s+3)) + ( unsigned short ) ((*(s+2)*256));
self->ip1=*(s+4);
self->ip2=*(s+5);
self->ip3=*(s+6);
self->ip4=*(s+7);
}

/*
Generic DTRACE script tracking failed bind() system calls:
*/
syscall::bind:return
/arg0<0 && execname != "crsctl.bin"/
{
printf("- Exec: %s - PID: %d  bind() failed with error : %d - fd : %d - IP: %d.%d.%d.%d - Port: %d " , execname, pid, arg0, self->fd,
self->ip1, self->ip2, self->ip3, self->ip4,    self->port  );
}

DTRACE OUTPUT :
[root@hract21 DTRACE]# dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0     89                   connect:return - Exec: mdnsd.bin - PID: 26518  connect() failed with error : -101 - fd : 39 - IP: 17.17.17.17 - Port: 256
0    103                      bind:return - Exec: gipcd.bin - PID: 26658  bind() failed with error : -99 - fd : 87 - IP: 192.168.7.121 - Port: 0
0    103                      bind:return - Exec: gipcd.bin - PID: 26696  bind() failed with error : -99 - fd : 87 - IP: 192.168.7.121 - Port: 0
0    103                      bind:return - Exec: gipcd.bin - PID: 26722  bind() failed with error : -99 - fd : 87 - IP: 192.168.7.121 - Port: 0
0    103                      bind:return - Exec: gipcd.bin - PID: 26740  bind() failed with error : -99 - fd : 87 - IP: 192.168.7.121 - Port: 0
0    103                      bind:return - Exec: gipcd.bin - PID: 26757  bind() failed with error : -99 - fd : 87 - IP: 192.168.7.121 - Port: 0

Investigate & Fix :
Check profile.xml
[root@hract21 network-scripts]#   $GRID_HOME/bin/gpnptool get 2>/dev/null  |  xmllint --format - | egrep 'CSS-Profile|ASM-Profile|Network id'
<gpnp:HostNetwork id="gen" HostName="*">
<gpnp:Network id="net1" IP="192.168.5.0" Adapter="eth1" Use="public"/>
<gpnp:Network id="net2" IP="192.168.2.0" Adapter="eth2" Use="asm,cluster_interconnect"/>
<orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/>
<orcl:ASM-Profile id="asm" DiscoveryString="/dev/asm*" SPFile="+DATA/ract2/ASMPARAMETERFILE/registry.253.870352347" Mode="remote"/>
-> eth1 is our PUBLIC network interface - with 192.168.5.0 as the related NETWORK address

[root@hract21 Desktop]# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 08:00:27:7D:8E:49
inet addr:192.168.5.121  Bcast:192.168.5.255  Mask:255.255.255.0
inet6 addr: fe80::a00:27ff:fe7d:8e49/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
--> ifconfig looks good but why is gipcd.bin picking up  192.168.7.121 ?
[root@hract21 Desktop]# ping hract21
PING hract21 (192.168.7.121) 56(84) bytes of data.
--> ping uses wrong address too and hangs

[root@hract21 Desktop]# grep hract21 /etc/hosts
192.168.7.121 hract21 hract21.example.com

FIX --> Modify hostname entry in /etc/hosts

2 thoughts on “Troubleshooting Clusterware startup problems with DTRACE”

Leave a Reply

Your email address will not be published. Required fields are marked *