Debugging Network problems in a 3 node cluster using a bash script

Overview

  • The bash script works for 11.2 RAC systems with a valid SSH configuration
  • Script may have  Bugs and needs modification to run on a 2-node or 4-node cluster
  • It’s a good idea to have this script ready configured and tested before any network problem comes up
  • Local listener ora.LISTENER.lsnr in grac43 depends on VIP resource ora.scan3.vip
  • Script  ./rac_net_testing.sh:   download location
  • Run this script as user root or grid – ssh must work and tools like svrctl, olsndides must be in your PATH and work via ssh ( short test  : ssh grac42 $GRID_HOME/bin/olsnodes  )
  • root or grid user must use bash shell  ( if using csh ssh commands may return:  Ambiguous output redirect error )
  • If ping,nslookup commands are failing  very intermittent check whether Linux firewall is disabled. For details check following article.
  • Configure this script by running
      • Stage I:  ./rac_net_testing.sh   -precheck_rac
      • Stage II:  ./rac_net_testing.sh  -mtu
      • Stage III:  ./rac_net_testing.sh -ipaddr
      • Stage IV:  ./rac_net_testing.sh -gns
  • Collect static Network :  ./rac_net_testing.sh  -precheck_perf
  • Run specific Networking tests
      • Ping public nodenames: ./rac_net_testing.sh -pingpubip
      • Ping private IP addresses: ./rac_net_testing.sh -pingprivip
      • Traceroute PRIVATE network:   ./rac_net_testing.sh -traceroute
      • Testing Name Resolution:  ./rac_net_testing.sh -nslookup
      • Testing VIP status:  ./rac_net_testing.sh  -vip
      • Test SCAN VIP status: ./rac_net_testing.sh  -scan
  • Finally configure and run script with -netall options :   ./rac_net_testing.sh  -netall
  •   -netall runs all specific Networking tests and can be configured to run multiple times

Linux Command Reference

  • /bin/netstat -in
  • /sbin/ifconfig
  • /bin/ping -s <MTU> -c 2 -I source_IP nodename
  • /bin/traceroute -s source_IP -r -F  nodename-priv <MTU-28>
  • /usr/bin/nslookup

Preparing and configuring script ./rac_net_testing.sh

  • Note you must collect the need information by following Stage 0 – Stage IV
  • We will collect all the needed data to configure parameters for our script in a step by step approach
  • You only need to know your RAC hostnames
  
Stage I:  Run  ./rac_net_testing.sh  and gather further parameters we need to run this script 
Explore script parameters: PUB_IF  PRIV_IF host1 host2 host3   priv1 priv2 priv3   scan scan1 scan2 scan3   fullscan  

# ./rac_net_testing.sh   -precheck_rac
*************************************************
*** Generic RAC check   ***
*************************************************
*** CLuster-Name:
grac4
*** Nodeapps Info: GNS/ONS/VIP/Network device
Network exists: 1/192.168.1.0/255.255.255.0/eth1, type dhcp
VIP exists: /192.168.1.167/192.168.1.167/192.168.1.0/255.255.255.0/eth1, hosting node grac41
VIP exists: /192.168.1.178/192.168.1.178/192.168.1.0/255.255.255.0/eth1, hosting node grac42
VIP exists: /192.168.1.177/192.168.1.177/192.168.1.0/255.255.255.0/eth1, hosting node grac43
GSD exists
ONS exists: Local port 6100, remote port 6200, EM port 2016
*** SCAN Info: 
SCAN name: grac4-scan.grid4.example.com, Network: 1/192.168.1.0/255.255.255.0/eth1
SCAN VIP name: scan1, IP: /grac4-scan.grid4.example.com/192.168.1.171
SCAN VIP name: scan2, IP: /grac4-scan.grid4.example.com/192.168.1.251
SCAN VIP name: scan3, IP: /grac4-scan.grid4.example.com/192.168.1.173
*** CLuster INFO :
Host Cluster-No Private-Interc. VIP
grac41    1    192.168.2.101    192.168.1.167
grac42    2    192.168.2.102    192.168.1.178
grac43    3    192.168.2.103    192.168.1.177

*** GPnP Info - Verify profile.xml  on all nodes
Is GPNPD daemon  running ? - If not CLSGPNP_NO_DAEMON error  should be reported
grac41.example.com
----
Is GPNPD daemon  running ? - If not CLSGPNP_NO_DAEMON error  should be reported
grac42.example.com
Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 
Error CLSGPNP_NO_DAEMON getting profile.
--> GPnPD not running - only local profile is available  - cleck whether CW is up on grac42  
----
Is GPNPD daemon  running ? - If not CLSGPNP_NO_DAEMON error  should be reported
grac43.example.com
----
--> Check ProfileSequence: 
grac41.example.com
ProfileSequence="11" ClusterName="grac4"
----
--> Check ProfileSequence: 
grac42.example.com
ProfileSequence="11" ClusterName="grac4"
----
--> Check ProfileSequence: 
grac43.example.com
ProfileSequence="11" ClusterName="grac4"
----
--> Profile.xml extract 
grac41.example.com
    <gpnp:HostNetwork id="gen" HostName="*">
      <gpnp:Network id="net1" IP="192.168.1.0" Adapter="eth1" Use="public"/>
      <gpnp:Network id="net2" IP="192.168.2.0" Adapter="eth2" Use="cluster_interconnect"/>
  <orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/>
  <orcl:ASM-Profile id="asm" DiscoveryString="/dev/asm*,/dev/oracleasm/disks/*" SPFile="+OCR/grac4/asmparameterfile/spfileCopyASM.ora"/>
----
--> Profile.xml extract 
grac42.example.com
    <gpnp:HostNetwork id="gen" HostName="*">
      <gpnp:Network id="net1" IP="192.168.1.0" Adapter="eth1" Use="public"/>
      <gpnp:Network id="net2" IP="192.168.2.0" Adapter="eth2" Use="cluster_interconnect"/>
  <orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/>
  <orcl:ASM-Profile id="asm" DiscoveryString="/dev/asm*,/dev/oracleasm/disks/*" SPFile="+OCR/grac4/asmparameterfile/spfileCopyASM.ora"/>
----
--> Profile.xml extract 
grac43.example.com
    <gpnp:HostNetwork id="gen" HostName="*">
      <gpnp:Network id="net1" IP="192.168.1.0" Adapter="eth1" Use="public"/>
      <gpnp:Network id="net2" IP="192.168.2.0" Adapter="eth2" Use="cluster_interconnect"/>
  <orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/>
  <orcl:ASM-Profile id="asm" DiscoveryString="/dev/asm*,/dev/oracleasm/disks/*" SPFile="+OCR/grac4/asmparameterfile/spfileCopyASM.ora"/>
----
--> Note GPnP data from all nodes should be identical  
-->  Parameters explored running Stage II:                                      Variable settings 
     Adapter="eth1" Use="public"                                                PUB_IF=eth1
     Adapter="eth2" Use="cluster_interconnect"                                  PRIV_IF=eth2
     Host    Cluster-No Private-Interc.   VIP
     grac41     1       192.168.2.101   192.168.1.167                           host1=grac41 priv1=192.168.2.101 vip1=192.168.1.167 
     grac42     2       192.168.2.102   192.168.1.178                           host2=grac43 priv2=192.168.2.102 vip2=192.168.1.178
     grac43     3       192.168.2.103   192.168.1.177                           host3=grac43 priv3=192.168.2.103 vip3=192.168.1.177
     SCAN VIP name: scan1, IP: /grac4-scan.grid4.example.com/192.168.1.171      scan1=192.168.1.171
     SCAN VIP name: scan2, IP: /grac4-scan.grid4.example.com/192.168.1.251      scan2=192.168.1.251
     SCAN VIP name: scan3, IP: /grac4-scan.grid4.example.com/192.168.1.173      scan3=192.168.1.173
     SCAN name: grac4-scan.grid4.example.com                                    scan=grac4-scan  ( short name )
                                                                                fullscan=grac4-scan.grid4.example.com ( FQDN)

Stage II:  Explore MTU size
Explore script parameters: MTU MTU28
#   ./rac_net_testing.sh  -mtu
TESTING MTU Size  
grac41.example.com
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth1       1500   0  2296946      1      0      0  2177438      0      0      0 BMRU
eth1:1     1500   0      - no statistics available -                            BMRU
eth1:3     1500   0      - no statistics available -                            BMRU
eth1:4     1500   0      - no statistics available -                            BMRU
eth2       1500   0 19155395   2055      0      0 13978212      0      0      0 BMRU
eth2:1     1500   0      - no statistics available -                            BMRU
grac42.example.com
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth1       1500   0    93245      0      0      0    78976      0      0      0 BMRU
eth1:1     1500   0      - no statistics available -                            BMRU
eth1:2     1500   0      - no statistics available -                            BMRU
eth2       1500   0  4622591      0      0      0  4648030      0      0      0 BMRU
eth2:1     1500   0      - no statistics available -                            BMRU
grac43.example.com
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth1       1500   0     4566      0      0      0     3023      0      0      0 BMRU
eth1:1     1500   0      - no statistics available -                            BMRU
eth1:2     1500   0      - no statistics available -                            BMRU
eth2       1500   0   206817      0      0      0   150402      0      0      0 BMRU
eth2:1     1500   0      - no statistics available -                            BMRU
          ---  TESTING MTU Size  done --- 
--> MTU size is 1500  (      MTU28 = MTU - 28 = 1500 - 28 = 1472 )
    Change now variables MTU and MTU28  in  ./rac_net_testing.sh 
    MTU=1500
    MTU28=1472                  

Stage III: Explore IP adresses, Broadcast address, Netmask  and Device status
Verfify script parameters: priv1 priv2 priv3
Explore script parameters: pub1 pub2 pub3           
[root@grac41 NET]#   ./rac_net_testing.sh  -ipaddr
TESTING - Info Public Interfaces   
grac41.example.com
eth1      Link encap:Ethernet  HWaddr 08:00:27:89:E9:A2  
          inet addr:192.168.1.101  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
grac42.example.com
eth1      Link encap:Ethernet  HWaddr 08:00:27:63:08:07  
          inet addr:192.168.1.102  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
grac43.example.com
eth1      Link encap:Ethernet  HWaddr 08:00:27:F6:18:43  
          inet addr:192.168.1.103  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          ---  TESTING Public Interaces  done --- 
TESTING - Info Private Interfaces   
grac41.example.com
eth2      Link encap:Ethernet  HWaddr 08:00:27:6B:E2:BD  
          inet addr:192.168.2.101  Bcast:192.168.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
grac42.example.com
eth2      Link encap:Ethernet  HWaddr 08:00:27:DF:79:B9  
          inet addr:192.168.2.102  Bcast:192.168.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
grac43.example.com
eth2      Link encap:Ethernet  HWaddr 08:00:27:1C:30:DD  
          inet addr:192.168.2.103  Bcast:192.168.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          ---  TESTING Private Interaces  done --- 
--> PRIVATE Interfaces should be already configured but we can again confirm settings in rac_net_testing.sh
      priv1=192.168.2.101
      priv2=192.168.2.102
      priv3=192.168.2.103
    For PUBLIC interface the above output translates to 
      pub1=192.168.1.101
      pub2=192.168.1.102
      pub3=192.168.1.103

Stage IV: Explore GNS and retrieve VIP hostnames and SCAN VIP addresses 
Verify  script parameters: scan1 scan2 scan3 
Explore script parameters: vip1  vip1  vip3
root@grac41 NET]# ./rac_net_testing.sh  -gns
TESTING GNS  
GNS is enabled.
GNS is listening for DNS server requests on port 53
GNS is using port 5353 to connect to mDNS
GNS status: OK
Domain served by GNS: grid4.example.com
GNS version: 11.2.0.4.0
GNS VIP network: ora.net1.network
Name            Type Value           Parameters set in  ./rac_net_testing.sh
grac4-scan      A    192.168.1.171
grac4-scan      A    192.168.1.173
grac4-scan      A    192.168.1.251
grac4-scan1-vip A    192.168.1.171  --> scan1=192.168.1.171
grac4-scan2-vip A    192.168.1.251  --> scan2=192.168.1.251
grac4-scan3-vip A    192.168.1.173  --> scan3=192.168.1.173
grac41-vip      A    192.168.1.167  --> vip1=grac41-vip
grac42-vip      A    192.168.1.178  --> vip2=grac42-vip
grac43-vip      A    192.168.1.177  --> vip3=grac43-vip
          ---  TESTING GNS  done --
--> Now script rac_net_testing.sh is configured and we can start Network testing

Collect static Network  data

Usage   
  # ./rac_net_testing.sh  -precheck_perf
  # ./rac_net_testing.sh  -precheck_perf 2>&1  |  tee  rac_pre_perf.TRC

[root@grac41 NET]#  ./rac_net_testing.sh  -precheck_perf 2>&1  |  tee  rac_pre_perf.TRC
*************************************************
*** Firewall should be disabled on all nodes  ***
*************************************************
grac41.example.com
iptables: Firewall is not running.
grac42.example.com
iptables: Firewall is not running.
grac43.example.com
iptables: Firewall is not running.
--> Status ok 
*******************************************************************************
*** netstat should report the following                                     ***
***  - MTU size sould be equal in all nodes                                 ***
***  - Network Devices should be up and running ( Flg: RU )                 ***
***  - Check statistics for RX/TX packets ( RX-ERR RX-DRP RX-OVR,...     )  ***
***  - Compare Broadcast and Netmask on all nodes                           ***
*******************************************************************************
grac41.example.com
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth1       1500   0  2378871      1      0      0  2254772      0      0      0 BMRU
eth1:1     1500   0      - no statistics available -                            BMRU
eth1:3     1500   0      - no statistics available -                            BMRU
eth1:4     1500   0      - no statistics available -                            BMRU
eth2       1500   0 22782549   2431      0      0 17100522      0      0      0 BMRU
eth2:1     1500   0      - no statistics available -                            BMRU
--> Not to much errors - looks good -  Flg - RU means RUNNING UP


grac42.example.com
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth1       1500   0   155655      0      0      0   142486      0      0      0 BMRU
eth1:1     1500   0      - no statistics available -                            BMRU
eth1:2     1500   0      - no statistics available -                            BMRU
eth2       1500   0  8488235      0      0      0  8962759      0      0      0 BMRU
eth2:1     1500   0      - no statistics available -                            BMRU
grac43.example.com
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth1       1500   0    72237      0      0      0    53788      0      0      0 BMRU
eth1:1     1500   0      - no statistics available -                            BMRU
eth1:2     1500   0      - no statistics available -                            BMRU
eth2       1500   0  2839288      0      0      0  2781127      0      0      0 BMRU
eth2:1     1500   0      - no statistics available -                            BMRU
*************************************************************************
*** 11.2 RAC manual suggest following search order: hosts:  dns files ***
*************************************************************************
grac41.example.com
hosts:      files dns
grac42.example.com
hosts:      files dns
grac43.example.com
hosts:      files dns
--> Order should be changed: dns files 
*********************************************************
*** /etc/hosts should be consistent an all nodes files ***
*********************************************************
grac41.example.com
127.0.0.1   localhost localhost.localdomain 
192.168.1.101 grac41.example.com grac41

grac42.example.com
127.0.0.1   localhost localhost.localdomain 
192.168.1.102 grac42.example.com grac42

grac43.example.com
127.0.0.1   localhost localhost.localdomain 
192.168.1.103 grac43.example.com grac43
--> Even if you are using a DNS, Oracle recommends that you add lines to the /etc/hosts file on each node, 
    specifying the public IP addresses.
    127.0.0.1 should not map to SCAN name, public, private and VIP hostname

****************************************************************
*** /etc/resolv.conf should be consistent an all nodes files ***
****************************************************************
# Generated by NetworkManager
search example.com grid4.example.com de.oracle.com
nameserver 192.168.1.50
# Generated by NetworkManager
search example.com grid4.example.com de.oracle.com
nameserver 192.168.1.50
nameserver 192.135.82.44
nameserver 192.168.1.1
# Generated by NetworkManager
search example.com grid4.example.com de.oracle.com
nameserver 192.168.1.50
nameserver 192.135.82.44
nameserver 192.168.1.1
--> /etc/resolv not consistent - needs to be fixed 
**********************************************************************************
*** SCAN listner , SCAN VIPS and nslookup SCAN Info should be consistent       ***
***  - all SCAN VIPs should be ONLINE                                          ***
***  - for each IP address returned from nslookup for SCAN NAME  - there       ***
***    should be SCAN VIP in status ONLINE                                     ***
***  - as a first test ping all IP addresss  returned from nslookup SCAN NAME  ***
**********************************************************************************
SCAN name: grac4-scan.grid4.example.com, Network: 1/192.168.1.0/255.255.255.0/eth1
SCAN VIP name: scan1, IP: /grac4-scan.grid4.example.com/192.168.1.171
SCAN VIP name: scan2, IP: /grac4-scan.grid4.example.com/192.168.1.251
SCAN VIP name: scan3, IP: /grac4-scan.grid4.example.com/192.168.1.173
SCAN Listener LISTENER_SCAN1 exists. Port: TCP:1521
SCAN Listener LISTENER_SCAN2 exists. Port: TCP:1521
SCAN Listener LISTENER_SCAN3 exists. Port: TCP:1521
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node grac43
SCAN Listener LISTENER_SCAN2 is enabled
SCAN listener LISTENER_SCAN2 is running on node grac41
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node grac42

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.23.rc1.el6_5.1 <<>> grac4-scan.grid4.example.com +noall +answer
;; global options: +cmd
grac4-scan.grid4.example.com. 11 IN    A    192.168.1.251
grac4-scan.grid4.example.com. 11 IN    A    192.168.1.171
grac4-scan.grid4.example.com. 11 IN    A    192.168.1.173
--> DNS zone delegation is working 

$ nslookup grac4-scan
Server:        192.168.1.50
Address:    192.168.1.50#53

Non-authoritative answer:
Name:    grac4-scan.grid4.example.com
Address: 192.168.1.171
Name:    grac4-scan.grid4.example.com
Address: 192.168.1.173
Name:    grac4-scan.grid4.example.com
Address: 192.168.1.251
--> SCAN address resolved by DNS and GNS using zone delagation 

**********************************************************************************************************
*** For further Info please read:                                                                      ***
***   How to Validate Network and Name Resolution Setup for the Clusterware and RAC (Doc ID 1054902.1) ***
**********************************************************************************************************

Ping all public nodenames from the local public IP with packet size of MTU

#   ./rac_net_testing.sh -pingpubip | egrep 'TESTING|EXECUTE|SUCCESS|ERROR'
TESTING : Ping all public nodenames from the local public IP with packet size of 1500 bytes on node: grac41 
EXECUTE Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 grac41"
SUCCESS Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 grac41" - : Status 0
EXECUTE Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 grac41"
SUCCESS Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 grac41" - : Status 0
....
SUCCESS Command - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.1.103 grac43" - : Status 0
EXECUTE Command - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.1.103 grac43"
SUCCESS Command - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.1.103 grac43" - : Status 0
          ---  TESTING public nodenames from the local public IP on node grac43 done ---

Ping all public nodenames from the local public IP with packet size of MTU

#   ./rac_net_testing.sh - pingpubip | egrep 'TESTING|EXECUTE|SUCCESS|ERROR'
TESTING : Ping all public nodenames from the local public IP with packet size of 1500 bytes on node: grac41 
EXECUTE Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 grac41"
SUCCESS Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 grac41" - : Status 0
EXECUTE Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 grac41"
SUCCESS Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 grac41" - : Status 0
....
EXECUTE Command - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.1.103 grac43"
SUCCESS Command - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.1.103 grac43" - : Status 0
EXECUTE Command - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.1.103 grac43"
SUCCESS Command - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.1.103 grac43" - : Status 0
          ---  TESTING public nodenames from the local public IP on node grac43 done ---

Ping all private IPS from the local private IP with packet size of MTU

[root@grac41 NET]# ./rac_net_testing.sh -pingprivip  | egrep 'TESTING|EXECUTE|SUCCESS|ERROR'
TESTING Ping all private IP(s) from all local private IP(s) with packet size of 1500 bytes: 192.168.2.101  
EXECUTE Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.2.101 192.168.2.101"
 SUCCESS Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.2.101 192.168.2.101" - : Status 0
EXECUTE Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.2.101 192.168.2.101"
SUCCESS Command - ....
SUCCESS Command - ssh grac42 "/bin/ping -s 1500 -c 2 -I 192.168.2.102 192.168.2.103" - : Status 0
EXECUTE Command - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.2.103 192.168.2.103"
SUCCESS Command - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.2.103 192.168.2.103" - : Status 0
EXECUTE Command - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.2.103 192.168.2.103"
SUCCESS Command - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.2.103 192.168.2.103" - : Status 0
          ---  TESTING  Private IP  done ---

Traceroute PRIVATE network  : Size MTU-28 = 1472 bytes

 ./rac_net_testing.sh -traceroute
***********************************************************************************************
*** TESTING Traceroute PRIVATE network                                                      ***
***  - MTU size packet traceroute complete in 1 hop without going through the routing table ***
***  - For MTU size 1500 traceroute packets should be MTU-28=1472 bytes                     ***
***********************************************************************************************

EXECUTE Command - - ssh grac41 "/bin/traceroute -s 192.168.2.101 -r -F 192.168.2.101 1472"
traceroute to 192.168.2.101 (192.168.2.101), 30 hops max, 1472 byte packets
1  grac41int.example.com (192.168.2.101)  0.016 ms  0.005 ms  0.009 ms
SUCCESS Command - ssh grac41 "/bin/traceroute -s 192.168.2.101 -r -F 192.168.2.101 1472" - : Status 0

EXECUTE Command - - ssh grac42 "/bin/traceroute -s 192.168.2.102 -r -F 192.168.2.101 1472"
traceroute to 192.168.2.101 (192.168.2.101), 30 hops max, 1472 byte packets
1  grac41int.example.com (192.168.2.101)  0.523 ms  0.271 ms  0.192 ms
SUCCESS Command - ssh grac42 "/bin/traceroute -s 192.168.2.102 -r -F 192.168.2.101 1472" - : Status 0

EXECUTE Command - - ssh grac43 "/bin/traceroute -s 192.168.2.103 -r -F 192.168.2.101 1472"
traceroute to 192.168.2.101 (192.168.2.101), 30 hops max, 1472 byte packets
1  grac41int.example.com (192.168.2.101)  3.616 ms  3.529 ms  3.477 ms
SUCCESS Command - ssh grac43 "/bin/traceroute -s 192.168.2.103 -r -F 192.168.2.101 1472" - : Status 0

...
EXECUTE Command - - ssh grac43 "/bin/traceroute -s 192.168.2.103 -r -F 192.168.2.103 1472"
traceroute to 192.168.2.103 (192.168.2.103), 30 hops max, 1472 byte packets
1  grac43int.example.com (192.168.2.103)  0.017 ms  0.004 ms  0.004 ms
SUCCESS Command - ssh grac43 "/bin/traceroute -s 192.168.2.103 -r -F 192.168.2.103 1472" - : Status 0
---  TESTING  Traceroute PRIVATE network done ---

Testing Name Resolution

[root@grac41 NET]#  ./rac_net_testing.sh -nslookup
TESTING Name Resolution 
EXECUTE Command - ssh grac41 "/usr/bin/nslookup grac4-scan "
Server:        192.168.1.50
Address:    192.168.1.50#53
Non-authoritative answer:
Name:    grac4-scan.grid4.example.com
Address: 192.168.1.173
Name:    grac4-scan.grid4.example.com
Address: 192.168.1.251
Name:    grac4-scan.grid4.example.com
Address: 192.168.1.171
SUCCESS Command - ssh grac41 "/usr/bin/nslookup grac4-scan" - : Status 0

EXECUTE Command - ssh grac41 "/usr/bin/nslookup grac41-vip "
Server:        192.168.1.50
Address:    192.168.1.50#53
Non-authoritative answer:
Name:    grac41-vip.grid4.example.com
Address: 192.168.1.167
SUCCESS Command - ssh grac41 "/usr/bin/nslookup grac41-vip" - : Status 0

...
EXECUTE Command - ssh grac43 "/usr/bin/nslookup grac43-vip "
Server:        192.168.1.50
Address:    192.168.1.50#53
Non-authoritative answer:
Name:    grac43-vip.grid4.example.com
Address: 192.168.1.177
SUCCESS Command - ssh grac43 "/usr/bin/nslookup grac43-vip" - : Status 0
          ---  TESTING  Name Resolution done ---

Testings  VIPs connectivty  – create and solve a VIP related  problem

Create Test Scenario - Stop a VIP on node grac43
[root@grac41 Desktop]#  srvctl stop vip -n grac43 -f

Test VIP status using  ./rac_net_testing.sh
[root@grac41 NET]#  ./rac_net_testing.sh  -vip  | egrep 'TESTING|EXECUTE|SUCCESS|ERROR' 
TESTING VIP   
EXECUTE Command - ssh grac41 "/bin/ping -c 2 grac41-vip "
SUCCESS Command - ssh grac41 "/bin/ping -c 2 grac41-vip" - : Status 0
EXECUTE Command - ssh grac41 "/bin/ping -c 2 grac42-vip "
SUCCESS Command - ssh grac41 "/bin/ping -c 2 grac42-vip" - : Status 0
EXECUTE Command - ssh grac41 "/bin/ping -c 2 grac43-vip "
ERROR:: Command - ssh grac41 "/bin/ping -c 2 grac43-vip " - failed: Status 1
EXECUTE Command - ssh grac42 "/bin/ping -c 2 grac41-vip "
SUCCESS Command - ssh grac42 "/bin/ping -c 2 grac41-vip" - : Status 0
EXECUTE Command - ssh grac42 "/bin/ping -c 2 grac42-vip "
SUCCESS Command - ssh grac42 "/bin/ping -c 2 grac42-vip" - : Status 0
EXECUTE Command - ssh grac42 "/bin/ping -c 2 grac43-vip "
ERROR:: Command - ssh grac42 "/bin/ping -c 2 grac43-vip " - failed: Status 1
EXECUTE Command - ssh grac43 "/bin/ping -c 2 grac41-vip "
SUCCESS Command - ssh grac43 "/bin/ping -c 2 grac41-vip" - : Status 0
EXECUTE Command - ssh grac43 "/bin/ping -c 2 grac42-vip "
SUCCESS Command - ssh grac43 "/bin/ping -c 2 grac42-vip" - : Status 0
EXECUTE Command - ssh grac43 "/bin/ping -c 2 grac43-vip "
ERROR:: Command - ssh grac43 "/bin/ping -c 2 grac43-vip " - failed: Status 1
--> From all nodes  grac43-vip is not reachable

Verify CW status
[root@grac41 NET]# crs
NAME                           TARGET     STATE           SERVER       STATE_DETAILS   
-------------------------      ---------- ----------      ------------ ------------------
ora.grac41.vip                 ONLINE     ONLINE          grac41        
ora.grac42.vip                 ONLINE     ONLINE          grac42        
ora.grac43.vip                 OFFLINE    OFFLINE                 
..
ora.LISTENER.lsnr              ONLINE     ONLINE          grac41        
ora.LISTENER.lsnr              ONLINE     ONLINE          grac42        
ora.LISTENER.lsnr              OFFLINE    OFFLINE         grac43 

--> ora.grac43.vip OFFLINE and ora.LISTENER.lsnr on grac43 OFFLINE

FIX: Start ora.grac43.vip  resource and local listener : ora.LISTENER.lsnr
   # srvctl start vip -n grac43
   # srvctl start listener -n grac43

Testings SCAN VIPs connectivty  – create and solve a SCAN VIP related  problem

Create Test Scenario - Stop SCAN VIP on node grac43
[root@grac41 Desktop]# srvctl stop scan -i 3 -f

Test SCAN VIP status runnning  ./rac_net_testing.sh
[root@grac41 NET]#   ./rac_net_testing.sh  -scan  | egrep 'TESTING|EXECUTE|SUCCESS|ERROR' 
TESTING SCAN   
EXECUTE Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 192.168.1.171"
SUCCESS Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 192.168.1.171" - : Status 0
EXECUTE Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 192.168.1.171"
SUCCESS Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 192.168.1.171" - : Status 0
..
ERROR:: Command failed - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 192.168.1.173" - failed: Status 1
EXECUTE Command - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 192.168.1.173"
ERROR:: Command failed - ssh grac41 "/bin/ping -s 1500 -c 2 -I 192.168.1.101 192.168.1.173" - failed: Status 1
EXECUTE Command - ssh grac42 "/bin/ping -s 1500 -c 2 -I 192.168.1.102 192.168.1.171"
SUCCESS Command - ssh grac42 "/bin/ping -s 1500 -c 2 -I 192.168.1.102 192.168.1.171" - : Status 0
EXECUTE Command - ssh grac42 "/bin/ping -s 1500 -c 2 -I 192.168.1.102 192.168.1.171"
SUCCESS Command - ssh grac42 "/bin/ping -s 1500 -c 2 -I 192.168.1.102 192.168.1.171" - : Status 0
EXECUTE..
ERROR:: Command failed - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.1.103 192.168.1.173" - failed: Status 1
EXECUTE Command - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.1.103 192.168.1.173"
ERROR:: Command failed - ssh grac43 "/bin/ping -s 1500 -c 2 -I 192.168.1.103 192.168.1.173" - failed: Status 1
          ---  TESTING  SCAN  done --- 
--> SCAN VIP 192.168.1.102 192.168.1.173 has problems !

Verify CW status
[root@grac41 NET]# crs
NAME                           TARGET     STATE           SERVER       STATE_DETAILS   
-------------------------      ---------- ----------      ------------ ------------------
..
ora.scan1.vip                  ONLINE     ONLINE          grac41        
ora.scan2.vip                  ONLINE     ONLINE          grac43        
ora.scan3.vip                  OFFLINE    OFFLINE 
..
ora.LISTENER_SCAN1.lsnr        ONLINE     ONLINE          grac43        
ora.LISTENER_SCAN2.lsnr        ONLINE     ONLINE          grac41        
ora.LISTENER_SCAN3.lsnr        ONLINE     OFFLINE 
--> ora.scan3.vip  and   ora.LISTENER_SCAN3.lsnr  are OFFLINE


[root@grac41 NET]#  srvctl status  scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is running on node grac41
SCAN VIP scan2 is enabled
SCAN VIP scan2 is running on node grac43
SCAN VIP scan3 is enabled
SCAN VIP scan3 is not running
[root@grac41 NET]#  srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node grac41
SCAN Listener LISTENER_SCAN2 is enabled
SCAN listener LISTENER_SCAN2 is running on node grac43
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is not running

Identify the failing IP address in OCR
[root@grac43 ~]# crsctl status resource ora.scan3.vip -f
NAME=ora.scan3.vip
TYPE=ora.scan_vip.type
STATE=OFFLINE
TARGET=OFFLINE
..
SCAN_NAME=grac4-scan.grid4.example.com
USR_ORA_VIP=192.168.1.173

FIX: Start SCAN VIP and SCAN Listener  - starting SCAN VIP also starts SCAN Listener 
     # srvctl start scan -i 3

Run script ./rac_net_testing.sh with -netall option for repeated network conectivity testing

  • -netall section is the heart of the script
Afer you have configured the script ( Stage 0 - Stage IV ) you may configuring the -netall section 
This section runs  ping, traceroute and nslookup commands and can be used to rerun these tests 
 - you can add/remove options from the netall section 
 - you can add/remove  grep options to limit/expand the output 
 - $runcount and $sleeptime parameter  controls the script usage for -netall section  
      runcount=3
      sleeptime=1
Usage   
  # ./rac_net_testing.sh  -netall   
  # ./rac_net_testing.sh  -netall 2>&1  |  tee  rac_net_testing.TRC
  For a quick review you can check  rac_net_testing.TRC - but problems like too many hops need a manual review
  # grep ERROR rac_net_testing.TRC

Script Details :
    ...
    elif [ "$arg" == "-netall" ]; then
      for (( i=1; i<=$runcount; i++ ))
           do
                echo "***** RUN : $i ( Run count: $runcount ) ***** "
                run_test_ipaddr
                run_test_pingpubip   |  egrep 'TESTING|EXECUTE|SUCCESS|ERROR'  # Use egrep to limit ping Output
                run_test_pingprivip  |  egrep 'TESTING|EXECUTE|SUCCESS|ERROR'  # Use egrep to limit ping Output
                run_test_traceroute                                            # Don't use egrep here as we need to check the hops
                run_test_vip         |  egrep 'TESTING|EXECUTE|SUCCESS|ERROR'  # Use egrep to limit ping Output
                run_test_gns
                run_test_scan        |  egrep 'TESTING|EXECUTE|SUCCESS|ERROR'  # Use egrep to limit ping Output
                run_test_nslookup    |  egrep 'TESTING|EXECUTE|SUCCESS|ERROR'  # Use egrep to limit nslookup  Output
                # echo "***** DONE RUN : $i  ***** "
                sleep $sleeptime
           done
    else
    ..
Output from a successfull run of ./rac_net_testing.sh  -netall  on a 3 node cluster.

Error Handling

The script should return ERROR:: for most of the  failed command and also print the failed command
    ERROR:: Command - ssh grac43 "/usr/bin/nslookup grac43-vip " - failed: Status 1

After getting an error run the printed command  standalone to get more error details:  
   #  ssh grac43 "/usr/bin/nslookup grac43-vip " 
       ;; Got SERVFAIL reply from 192.168.1.50, trying next server
       ;; connection timed out; trying next origin
       Server:                192.168.1.50
       Address:       192.168.1.50#53
       ** server can't find grac43-vip: NXDOMAIN

To get a quick overview of all potential errors  you may run 
  # ./rac_net_testing.sh  -netall 2>&1  |  tee  rac_net_testing.TRC 
  # grep ERROR rac_net_testing.TRC
       ERROR:: Command - ssh grac41 "/usr/bin/nslookup grac43-vip " - failed: Status 1
       ERROR:: Command - ssh grac42 "/usr/bin/nslookup grac43-vip " - failed: Status 1        
       ERROR:: Command - ssh grac43 "/usr/bin/nslookup grac43-vip " - failed: Status 1
  --> As said for further debugging run the printed  commands after ERROR:: label

Multicast requirements

Reference

One thought on “Debugging Network problems in a 3 node cluster using a bash script”

  1. Hi Helmut

    Wow, very very very useful script for prechecking when setting up Oracle RAC environment. Can it be used in HPUX Operating System?

    Thank you
    Haris.

Leave a Reply

Your email address will not be published. Required fields are marked *