Troubleshooting Clusterware startup problems with DTRACE

First Steps which may avoid setting up DTRACE at all

Cleanup your special sockets file in /var/tmp/.oracle

Either reboot your OS or Cleanup sockets file and reboot CRS stack :
[root@hract21 Desktop]# crsctl stop crs -f
[root@hract21 Desktop]# rm -rf /var/tmp/.oracle/*
[root@hract21 Desktop]# crsctl start crs
 CRS-4123: Oracle High Availability Services has been started.

Note: A complete OS reboot may be needed to fix hanging processes waiting on DISKWAIT 
      If possible always try to do an OS reboot. 
      An OS reboot will always cleanup  /var/tmp/.oracle/*

Quickly verify your OS with a simple sh script : chk_os.sh

#!/bin/bash 
NS=ns1.example.com
HOSTNAME1=hract21.example.com
HOSTNAME2=hract22.example.com
PRIV_IP1=192.168.2.121
PRIV_IP2=192.168.2.122
PUBLIC_IF=eth1
PRIVATE_IF=eth2

echo ""
echo "Disk Space : "
df

echo ""
echo "Major Clusterware Executable Protections : "
ls -l $GRID_HOME/bin/ohasd*
ls -l $GRID_HOME/bin/orarootagent*
ls -l $GRID_HOME/bin/oraagent*
ls -l $GRID_HOME/bin/mdnsd*
ls -l $GRID_HOME/bin/evmd*
ls -l $GRID_HOME/bin/gpnpd*
ls -l $GRID_HOME/bin/evmlogger*
ls -l $GRID_HOME/bin/osysmond.*
ls -l $GRID_HOME/bin/gipcd*
ls -l $GRID_HOME/bin/cssdmonitor*
ls -l $GRID_HOME/bin/cssdagent*
ls -l $GRID_HOME/bin/ocssd*
ls -l $GRID_HOME/bin/octssd*
ls -l $GRID_HOME/bin/crsd
ls -l $GRID_HOME/bin/crsd.bin
ls -l $GRID_HOME/bin/tnslsnr


echo ""
echo "Ping Nameserver: "
ping -c 2  $NS 

echo ""
echo "Test your PUBLIC interface and your nameserver setup"
nslookup $HOSTNAME
ping -I $PUBLIC_IF -c 2   $HOSTNAME1
ping -I $PUBLIC_IF -c 2   $HOSTNAME2
 
ping -I $PRIVATE_IF -c 2   $PRIV_IP1 
ping -I $PRIVATE_IF -c 2   $PRIV_IP2

echo ""
echo "Verify protections for HOSTNAME.pid files should be : 644"
find $GRID_HOME -name hract21.pid  -exec ls -l {} \; 

echo ""
echo "Service iptables and avahi-daemon should not run - avahi-daemon uses CW port 5353 "
service iptables status
ps -elf |grep avahi | grep -v avahi

echo ""
echo "Ports :53 :5353 :42422 :8888 should not be used by NON-Clusterware processes "
echo "  - OC4J reports : tcp   0 0 ::ffff:127.0.0.1:8888  :::*  LISTEN   501 67433979  2580/java"           
netstat -taupen | egrep ":53 |:5353 |:42424 |:8888 "

echo ""
echo "Compare profile.xml the IP Address of PUBLIC and PRIVATE Interfaces "
echo " - Devices should report UP BROADCAST RUNNING MULTICAST "
echo " - Double check NETWORK addresses matches profile.xml settings   "
echo ""
$GRID_HOME/bin/gpnptool get 2>/dev/null  |  xmllint --format - | egrep 'CSS-Profile|ASM-Profile|Network id'
echo ""
ifconfig $PUBLIC_IF | egrep 'eth|inet addr|MTU'
echo ""
ifconfig $PRIVATE_IF | egrep 'eth|inet addr|MTU'

echo "Checking ASM disk status for disk named /dev/asm ...  - you may need to changes this "
ls -l  /dev/asm*

echo ""
echo "Verify ASM disk "
su - grid -c "ssh $HOSTNAME2 ocrcheck"
su - grid -c "ssh $HOSTNAME2  asmcmd lsdsk -k"
echo ""
su - grid -c "kfed read /dev/asmdisk1_10G | grep name"
echo ""
su - grid -c "kfed read /dev/asmdisk2_10G | grep name"
echo ""
su - grid -c "kfed read /dev/asmdisk3_10G | grep name"
echo ""
su - grid -c "kfed read /dev/asmdisk4_10G | grep name"
echo ""


Output:
..
Ports :53 :5353 :42422 :8888 should not be used by NON-Clusterware processes 
  - OC4J reports : tcp   0 0 ::ffff:127.0.0.1:8888  :::*  LISTEN   501 67433979  2580/java
udp        0      0 0.0.0.0:5353                0.0.0.0:*    501        54383580   28618/mdnsd.bin     
udp        0      0 0.0.0.0:5353                0.0.0.0:*    501        54383565   28618/mdnsd.bin     
udp        0      0 0.0.0.0:5353                0.0.0.0:*    501        54383564   28618/mdnsd.bin     
udp        0      0 0.0.0.0:5353                0.0.0.0:*    501        54383563   28618/mdnsd.bin     
udp        0      0 192.168.2.255:42424         0.0.0.0:*    0          54429417   28502/ohasd.bin     
udp        0      0 230.0.1.0:42424             0.0.0.0:*    0          54429416   28502/ohasd.bin     
udp        0      0 224.0.0.251:42424           0.0.0.0:*    0          54429415   28502/ohasd.bin     
udp        0      0 192.168.2.255:42424         0.0.0.0:*    501        54412444   28827/ocssd.bin     
udp        0      0 230.0.1.0:42424             0.0.0.0:*    501        54412443   28827/ocssd.bin     
udp        0      0 224.0.0.251:42424           0.0.0.0:*    501        54412442   28827/ocssd.bin     
udp        0      0 192.168.2.255:42424         0.0.0.0:*    501        54406273   28742/gipcd.bin     
udp        0      0 230.0.1.0:42424             0.0.0.0:*    501        54406272   28742/gipcd.bin     
udp        0      0 224.0.0.251:42424           0.0.0.0:*    501        54406271   28742/gipcd.bin     
udp        0      0 192.168.5.58:53             0.0.0.0:*    0          67400781   2472/gnsd.bin 
tcp        0      0 ::ffff:127.0.0.1:8888        LISTEN      501        67433979   2580/java  
--> mdnsd.bin is using port 5353
    ohasd.bin, ohasd.bin, gipcd.bin are using port 42424
    oc4j is using port 8888           
    GNS is using port 53 

Compare profile.xml the IP Address of PUBLIC and PRIVATE Intefaces 
 - Devices should report UP BROADCAST RUNNING MULTICAST 
 - Double check NETWORK addresses matches profile.xml settings   
    <gpnp:HostNetwork id="gen" HostName="*">
      <gpnp:Network id="net1" IP="192.168.5.0" Adapter="eth1" Use="public"/>
      <gpnp:Network id="net2" IP="192.168.2.0" Adapter="eth2" Use="asm,cluster_interconnect"/>
  <orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/>
  <orcl:ASM-Profile id="asm" DiscoveryString="/dev/asm*" SPFile="+DATA/ract2/ASMPARAMETERFILE/registry.253.870352347" Mode="remote"/>

eth1      Link encap:Ethernet  HWaddr 08:00:27:7D:8E:49  
          inet addr:192.168.5.121  Bcast:192.168.5.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth2      Link encap:Ethernet  HWaddr 08:00:27:4E:C9:BF  
          inet addr:192.168.2.121  Bcast:192.168.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

  --> IP="192.168.5.0" Adapter="eth1" should match --> eth1 : inet addr:192.168.5.121  Bcast:192.168.5.255  Mask:255.255.255.0 
      IP="192.168.2.0" Adapter="eth2" should match --> eth2 : inet addr:192.168.2.121  Bcast:192.168.2.255  Mask:255.255.255.0

Output from script chk_os.sh

Pages: 12345678910111213141516

2 thoughts on “Troubleshooting Clusterware startup problems with DTRACE”

JOSHUA says:

October 28, 2015 at 8:57 am

Many thx

This is very helpful

Reply
Dena says:

June 20, 2017 at 6:09 pm

I really like looking through an article that can make people think.
Also, thank you for permitting me to comment!

Reply

Leave a Reply Cancel reply