Troubleshooting Clusterware startup problems with DTRACE

Case VIII: GIPCD, GPNPD CSDD not starting as Nameserver is not reachable

  • OS Error [ ECONNREFUSED    111 -> Connection refused ]
Force that  error and monitor Clusterware Resource status after startup:
On the nameserver run:
[root@ns1 ~]# service named stop
Stopping named: .                                          [  OK  ]
*****  Local Resources: *****
Resource NAME               INST   TARGET    STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE    OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE    OFFLINE      -               STABLE
ora.crf                        1   ONLINE    ONLINE       hract21         STABLE
ora.crsd                       1   ONLINE    OFFLINE      -               STABLE
ora.cssd                       1   ONLINE    OFFLINE      hract21         STARTING
ora.cssdmonitor                1   ONLINE    ONLINE       hract21         STABLE
ora.ctssd                      1   ONLINE    OFFLINE      -               STABLE
ora.diskmon                    1   ONLINE    OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE    ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE    INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE    OFFLINE      -               STABLE
ora.gpnpd                      1   ONLINE    INTERMEDIATE hract21         STABLE
ora.mdnsd                      1   ONLINE    ONLINE       hract21         STABLE
ora.storage                    1   ONLINE    OFFLINE      -               STABLE
--> GIPCD, GPNPD CSDD not starting

CLUVFY :
[grid@hract21 trace]$ ~/CLUVFY/bin/cluvfy -version
PRVF-0002 : Could not retrieve local nodename

TRACEFILE review :
Grep trace files for any Resolve errors [ OS function:  getaddrinfo() ]
[grid@hract21 trace]$ grep "2015-02-17 14:1" * | grep gipcmodNetworkResolve
gipcd.trc:
2015-02-17 14:13:36.137197 :GIPCXCPT:2309576448:  gipcInternalEndpoint: failed to bind address to endpoint name 'tcp://hract21.example.com', ret gipcretFail (1)
2015-02-17 14:13:41.141266 :GIPCXCPT:2309576448:  gipcmodNetworkResolve: failed to create new address for osName 'hract21.example.com', name 'tcp://hract21.example.com'
2015-02-17 14:13:41.141285 :GIPCXCPT:2309576448:  gipcmodNetworkResolve: slos op  :  sgipcnPopulateAddrInfo
2015-02-17 14:13:41.141289 :GIPCXCPT:2309576448:  gipcmodNetworkResolve: slos dep :  Connection refused (111)
2015-02-17 14:13:41.141293 :GIPCXCPT:2309576448:  gipcmodNetworkResolve: slos loc :  getaddrinfo(
2015-02-17 14:13:41.141297 :GIPCXCPT:2309576448:  gipcmodNetworkResolve: slos info:  server not available,try again
2015-02-17 14:13:41.141342 :GIPCXCPT:2309576448:  gipcResolveF [gipcInternalBind : gipcInternal.c : 537]: EXCEPTION[ ret gipcretFail (1) ]  
          failed to resolve address 0x7fd764033be0 [0000000000000310] 
          { gipcAddress : name 'tcp://hract21.example.com', objFlags 0x0, addrFlags 0x8 }, flags 0x4000
2015-02-17 14:13:41.141365 :GIPCXCPT:2309576448:  gipcBindF [gipcInternalEndpoint : gipcInternal.c : 468]: EXCEPTION[ ret gipcretFail (1) ]  failed to bind endp 0x7fd764033070 [000000000000030e] { gipcEndpoint : localAddr 'tcp://hract21.example.com', remoteAddr '', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj (nil), sendp (nil) status 13flags 0x40008000, flags-2 0x0, usrFlags 0x240a0 }, addr 0x7fd764034890 [0000000000000315] { gipcAddress : name 'tcp://hract21.example.com', objFlags 0x0, addrFlags 0x8 }, flags 0x200a0

DTRACE SCRIPT helper:
Use strace to get an idea how to write a working DTRACE script
22752 connect(27, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.5.50")}, 16 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth2", ifr_broadaddr={AF_INET, inet_addr("192.168.2.255")}}) = 0
22752 <... connect resumed> )           = 0
22750 ioctl(28, SIOCGIFFLAGS <unfinished ...>
22752 poll([{fd=27, events=POLLOUT}], 1, 0 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth3", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
22752 <... poll resumed> )              = 1 ([{fd=27, revents=POLLOUT}])
22750 ioctl(28, SIOCGIFADDR <unfinished ...>
22752 sendto(27, "\320X\1\0\0\1\0\0\0\0\0\0\7hract21\7example\3com"..., 37, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth3", ifr_addr={AF_INET, inet_addr("192.168.3.121")}}) = 0
22752 <... sendto resumed> )            = 37
22750 ioctl(28, SIOCGIFNETMASK <unfinished ...>
22752 poll([{fd=27, events=POLLIN|POLLOUT}], 1, 5000 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth3", ifr_netmask={AF_INET, inet_addr("255.255.255.0")}}) = 0
22752 <... poll resumed> )              = 1 ([{fd=27, revents=POLLOUT}])
22750 ioctl(28, SIOCGIFBRDADDR <unfinished ...>
22752 sendto(27, "\16\227\1\0\0\1\0\0\0\0\0\0\7hract21\7example\3com"..., 37, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth3", ifr_broadaddr={AF_INET, inet_addr("192.168.3.255")}}) = 0
22752 <... sendto resumed> )            = -1 ECONNREFUSED (Connection refused)

--> connect call (  works with fd=27 works - parameter 2 of our connect call holds the IP adresss
    The following sendto() call ( sendto(27,.. )  fails with error ECONNREFUSED
    To select the right sendto call you need to use the PID ( 22752 ) and the filedescriptor fd=27 (   sendto(27, .. )

Requirements for DTRACE script details
- Collect info about the IP adress from a former connect() call ( we need to trace all conenct calls )
- Trace the sendto call for errors like ( ECONNREFUSED )
- Use Filedescriptor fd ( fd=27 ) to tie up the connect call and the sendto
- Always attach strace to the gipcd process to get an idea whether your oracle versions
  executes the same system calls in the same order

DTRACE SCRIPT :
syscall::connect:return
/self->port == ns_ip_port && execname != "crsctl.bin" /
{
printf("- Exec: %s - PID: %d  connect() - fd : %d - IP: %d.%d.%d.%d - Port: %d " , execname, pid, self->fd,
self->ip1, self->ip2, self->ip3, self->ip4, self->port  );
}

syscall::sendto:entry
/execname != "crsctl.bin" /
{
self->fds = arg0;
}

syscall::sendto:return
/arg0<0 &&  execname != "crsctl.bin"  /
{
printf("- Exec: %s - PID: %d  sendto() failed with error : %d - fd : %d " , execname, pid, arg0, self->fds );
}

DTRACE output:
[root@hract21 DTRACE]# !dt
dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0     93                    sendto:return - Exec: orarootagent.bi - PID: 29204  sendto() failed with error : -111 - fd : 15
0     93                    sendto:return - Exec: oraagent.bin - PID: 29308  sendto() failed with error : -111 - fd : 15
0     93                    sendto:return - Exec: oraagent.bin - PID: 29308  sendto() failed with error : -111 - fd : 15
0     89                   connect:return - Exec: gipcd.bin - PID: 29363  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0     93                    sendto:return - Exec: gipcd.bin - PID: 29363  sendto() failed with error : -111 - fd : 27
0     89                   connect:return - Exec: gipcd.bin - PID: 29363  connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0     93                    sendto:return - Exec: mdnsd.bin - PID: 29320  sendto() failed with error : -111 - fd : 7
0     93                    sendto:return - Exec: gpnpd.bin - PID: 29343  sendto() failed with error : -111 - fd : 15
--> In this sample the gipcd.bin is failing to communicate with the namesever
    The failed system call is sendto() - Error  ECONNREFUSED    111  - Connection refused following a
       successfull connect() system call.
   Note: Filedescritor  fd=27 signals that connect() and sendto() system call operates on the same socket/file discriptor

Investigate & Fix
[root@hract21 network-scripts]# ping ns1.example.com
ping: unknown host ns1.example.com

--> Fix : Restart your nameserver and check nameserver IP-Addres/Port

2 thoughts on “Troubleshooting Clusterware startup problems with DTRACE”

Leave a Reply

Your email address will not be published. Required fields are marked *