Troubleshooting Clusterware startup problems with DTRACE

Case IV:  GPNPD doesn’t start – mismatch between profile.xml and the PRIVATE INTERFACE address

Potential problem:

  • PUBLIC interface was changed without changing profile.xml
Monitor Clusterware Resource status after startup:
*****  Local Resources: *****
Resource NAME               INST   TARGET       STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE       OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE    OFFLINE      -               STABLE
ora.crf                        1   ONLINE    ONLINE       hract21         STABLE
ora.crsd                       1   ONLINE    OFFLINE      -               STABLE
ora.cssd                       1   ONLINE    OFFLINE      hract21         STARTING
ora.cssdmonitor                1   ONLINE     ONLINE       hract21         STABLE
ora.ctssd                      1   ONLINE    OFFLINE      -               STABLE
ora.diskmon                    1   ONLINE     OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE    ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE    INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE    ONLINE       hract21         STABLE
ora.gpnpd                      1   ONLINE    INTERMEDIATE hract21         STABLE
ora.mdnsd                      1   ONLINE    ONLINE       hract21         STABLE
ora.storage                    1   ONLINE    OFFLINE      -               STABLE
--> GPnPD daemon does not start

CLUVFY:
Cluvfy fails with PRVG-11050 error
[grid@hract21 CLUVFY]$  ssh hract22   ~/CLUVFY/bin/cluvfy stage -post crsinst -n hract21,hract22
Performing post-checks for cluster services setup
Checking node reachability...
Node reachability check passed from node "hract22"
Checking user equivalence...
User equivalence check passed for user "grid"
Checking node connectivity...
Checking hosts config file...
Verification of the hosts config file successful
ERROR:
PRVG-11050 : No matching interfaces "eth2" for subnet "192.168.2.0" on nodes "hract21"

TRACEFILE review :
alert.log:
2015-02-17 09:42:27.823 [OCSSD(15855)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc
2015-02-17 09:42:27.824 [OCSSD(15855)]CRS-1603: CSSD on node hract21 shutdown by user.
2015-02-17 09:42:27.823 [CSSDAGENT(15844)]CRS-5818: Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/grid/diag/crs/hract21/crs/trace/ohasd_cssdagent_root.trc.
Tue Feb 17 09:42:32 2015
Errors in file /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc  (incident=2977):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2977/ocssd_i2977.trc

2015-02-17 09:42:33.019 [OCSSD(15855)]CRS-8503: Oracle Clusterware OCSSD process with operating system process ID 15855 experienced fatal signal or exception code 6
Sweep [inc][2977]: completed
2015-02-17 09:42:38.005 [OHASD(11954)]CRS-2757: Command 'Start' timed out waiting for response from the resource 'ora.cssd'. Details at (:CRSPE00163:) {0:0:2} in /u01/app/grid/diag/crs/hract21/crs/trace/ohasd.trc.

ocssd.trc:
2015-02-17 09:42:32.451021 :    CSSD:2417551104: 
   clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963949, LATS 92477974, lastSeqNo 963946, uniqueness 1424074596, timestamp 1424162551/21220694
2015-02-17 09:42:32.451113 :    CSSD:2422281984: 
   clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963950, LATS 92477974, lastSeqNo 963947, uniqueness 1424074596, timestamp 1424162552/21220904
Trace file /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc
Oracle Database 12c Clusterware Release 12.1.0.2.0 - Production Copyright 1996, 2014 Oracle. All rights reserved.
DDE: Flood control is not active
CLSB:2467473152: Oracle Clusterware infrastructure error in OCSSD (OS PID 15855): Fatal signal 6 has occurred in program ocssd thread 2467473152; nested signal count is 1
Incident 2977 created, dump file: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2977/ocssd_i2977.trc
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
2015-02-17 09:42:33.108629 :    CSSD:2450904832: clssscWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 1000 with cvtimewait status 4294967186
2015-02-17 09:42:33.451785 :    CSSD:2417551104: clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963952, LATS 92478974, lastSeqNo 963949, uniqueness 1424074596, timestamp 1424162552/21221694
2015-02-17 09:42:33.451933 :    CSSD:2422281984: clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963953, LATS 92478974, lastSeqNo 963950, uniqueness 1424074596, timestamp 1424162553/21221904
--> Here we know that we have a networking problem

DTRACE OUTPUT:
- In this case DTRACE will no help .
Oracle will retrieve the IP-Addresses via ioctl can compare to profile.xml
32373 <... ioctl resumed> 200, {{"lo", {AF_INET, inet_addr("127.0.0.1")}}, {"eth0", {AF_INET, inet_addr("192.168.1.7")}},
{"eth1", {AF_INET, inet_addr("192.168.5.121")}}, {"eth2", {AF_INET, inet_addr("192.168.7.121")}},
{"eth3", {AF_INET, inet_addr("192.168.3.121")}}}}) = 0

Investigate & Fix :
Check profile.xml
[root@hract21 network-scripts]#   $GRID_HOME/bin/gpnptool get 2>/dev/null  |  xmllint --format - | egrep 'CSS-Profile|ASM-Profile|Network id'
<gpnp:HostNetwork id="gen" HostName="*">
<gpnp:Network id="net1" IP="192.168.5.0" Adapter="eth1" Use="public"/>
<gpnp:Network id="net2" IP="192.168.2.0" Adapter="eth2" Use="asm,cluster_interconnect"/>
<orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/>
<orcl:ASM-Profile id="asm" DiscoveryString="/dev/asm*" SPFile="+DATA/ract2/ASMPARAMETERFILE/registry.253.870352347" Mode="remote"/>
-> eth2 is our CI network interface - with 192.168.2.0 as the related NETWORK address

[grid@hract21 trace]$ ping -I eth2 192.168.2.122
Warning: cannot bind to specified iface, falling back: Operation not permitted
PING 192.168.2.122 (192.168.2.122) from 192.168.1.7 eth2: 56(84) bytes of data
--> This tells us we have a problem with our CI !

[root@hract21 network-scripts]# ifconfig eth2
eth2      Link encap:Ethernet  HWaddr 08:00:27:4E:C9:BF
inet addr:192.168.7.121  Bcast:192.168.7.255  Mask:255.255.255.0
inet6 addr: fe80::a00:27ff:fe4e:c9bf/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
--> eth2 is up and running but listening on the wrong network address
Fix :
  Change address for eth2  back to  inet addr:192.168.2.121 and restart network and CW

2 thoughts on “Troubleshooting Clusterware startup problems with DTRACE”

Leave a Reply

Your email address will not be published. Required fields are marked *