What OS process is stopping my resource ( like services ) ?

Overview:

  • This article provides some background info how to get some info what process stops a RAC service
  • If you are lucky you can use OSWatcher output or any OS monitoring tool to track the status change
  • CLIENT_PID:    TextMessage[9176] not always reports a still running OS process

Tracing a srvctl stop service command


Verify our service status 
[root@grac41 Desktop]# crs | egrep "NAME|---|grac41.svc"
NAME                           TARGET     STATE           SERVER       STATE_DETAILS   
-------------------------      ---------- ----------      ------------ ------------------
ora.grac4.grac41.svc           ONLINE     ONLINE          grac41   

Find process leader running the srvctl stop service operation
[oracle@grac41 ~]$ ps
  PID TTY          TIME CMD
12495 pts/5    00:00:00 bash

Stop the service 
[oracle@grac41 ~]$ strace -f -o s.trc srvctl stop service -d grac4 -s grac41
--> crsd.log:
2014-11-08 08:21:03.373: [UiServer][1505679104]{3:51776:993} Container [ Name: UI_STOP
    API_HDR_VER: 
    TextMessage[2]
    CLIENT: 
    TextMessage[]
    CLIENT_NAME: 
    TextMessage[/usr/bin/java]
    CLIENT_PID: 
    TextMessage[15151]
    CLIENT_PRIMARY_GROUP: 
    TextMessage[oinstall]
..
What is going on here ? 
Why is /usr/bin/java reported to stop the service ?
We expect the shell/srvctl command as CLIENT_PID

Lets monitor with pstree the operation
# pstree -p 12495
bash(12495)───strace(15119)───srvctl(15124)───java(15151)─┬─{java}(15157)
                                                          ├─{java}(15158)
                                                          ├─{java}(15159)
                                                          ├─{java}(15160)
                                                          ├─{java}(15161)
                                                          ├─{java}(15162)
                                                          ├─{java}(15163)
                                                          ├─{java}(15164)
                                                          ├─{java}(15165)
                                                          └─{java}(15300)

During the service start we have the following process stack: bash --> svrct --> java program
The first java process/thread is reported as CLIENT_PID :   java(15151)

Lets have a look at our strace output
15124 execve("/u01/app/oracle/product/11204/racdb/bin/srvctl", ["srvctl", "stop", "service", "-d", "grac4", "-s", "grac41"], [/* 26 vars */]) = 0
..
15124 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd8713d69d0) = 15151
..
15151 execve("/u01/app/oracle/product/11204/racdb/jdk/bin/java", ["/u01/app/oracle/product/11204/ra"..., "-DORACLE_HOME=/u01/app/oracle/pr"..., "-classpath", "/u01/app/oracle/product/11204/ra"..., "-Djava.net.preferIPv4Stack=true", "-Djava.util.logging.config.file="..., "oracle.ops.opsctl.OPSCTLDriver", "stop", "service", "-d", "grac4", "-s", "grac41"], [/* 28 vars */]) = 0
--> Here we can see that this java program is stopping the service 

Note it will be not easy to track the bash shell/srvctl/java process chain with OSWatcher. 
crsd.log will show us the java PID but getting a link vi PPID to the bash shell will not be 
that easy as the srvctl command very like has terminated and you will not be able to track down
to ths root process. Only if you're lucky and the OSWachter ps program tracks the java PID 
as reported by crsd-log (  CLIENT_PID: TextMessage[15151] ) you will be able to track down
to the related OS process by using PPID .

Tacking an resource stop command via Oracle process ( tnslsnr )

If an Oracle process stops a resource things are reported much easier listener resource
2014-11-07 19:52:53.277: [UiServer][1505679104]{3:51776:81} Container [ Name: UI_START
        API_HDR_VER:
        TextMessage[2]
        ASYNC_TAG:
        TextMessage[1]
        CLIENT:
        TextMessage[]
        CLIENT_NAME:
        TextMessage[tnslsnr]
        CLIENT_PID:
        TextMessage[9176]
        CLIENT_PRIMARY_GROUP:
        TextMessage[oinstall]
        EVENT_TAG:
        TextMessage[1]
        HOST:
        TextMessage[grac41]
        HOST_TAG:
        TextMessage[1]
        LOCALE:
        TextMessage[AMERICAN_AMERICA.AL32UTF8]
        NO_WAIT_TAG:
        TextMessage[1]
        RESOURCE:
        TextMessage[ora.LISTENER.lsnr^AUSR_ORA_OPI=true]
]

--> tnslsnr is start the ora.LISTENER.lsnr the group is oinstall - the PID is  9176 
[root@grac41 grac41]# pstree -p  9176
tnslsnr(9176)─┬─{tnslsnr}(9177)
              └─{tnslsnr}(9178)
[root@grac41 grac41]# ps -elf  |grep  9176
0 S grid      9176     1  0  80   0 - 54090 poll_s Nov07 ?        00:00:00 /u01/app/11204/grid/bin/tnslsnr LISTENER -inherit

Tracking a stop service request by using TINTs

Hosts:
grac41 : running srvctl command, resource home 
grac42 : PE Master ,  

Generic flow
grac41: 
  UiServer reports UI_STOP on grac41 ( this is the first message for our TINT )
  UiServer Sending message to PE - Client PID: 15151 ( Here we know the client PID )
  [    AGFW][1518286592]{3:51776:993} Agfw Proxy Server sending the reply to PE for message:RESOURCE_STOP[ora.grac4.grac41.svc 1 1] ID 4099:47658
   --> Sending message to the PE master 
grac42:
  CRSPE][4141836032]{3:51776:993} Processing PE command id=1049. Description: [Stop Resource : ..
  [   CRSPE][4141836032]{3:51776:993} RI [ora.grac4.grac41.svc 1 1] new target state: [OFFLINE] old value: [ONLINE]
  [   CRSPE][4141836032]{3:51776:993} RI [ora.grac4.grac41.svc 1 1] new internal state: [STOPPING] old value: [STABLE]
   --> Changing internal state   
  [   CRSPE][4141836032]{3:51776:993} Sending message to agfw: id = 47658
  [   CRSPE][4141836032]{3:51776:993} CRS-2673: Attempting to stop 'ora.grac4.grac41.svc' on 'grac41'

grac41:
  [    AGFW][1518286592]{3:51776:993} Received the reply to the message: RESOURCE_STOP[ora.grac4.grac41.svc 1 1] ID 4099:10786 from the agent /u01/app/11204/grid/bin/oraagent_oracle
  [    AGFW][1518286592]{3:51776:993} Agfw Proxy Server sending the reply to PE for message:RESOURCE_STOP[ora.grac4.grac41.svc 1 1] ID 4099:47658
  [    AGFW][1518286592]{3:51776:993} Received the reply to the message: RESOURCE_STOP[ora.grac4.grac41.svc 1 1] ID 4099:10786 from the agent /u01/app/11204/grid/bin/oraagent_oracle
  [    AGFW][1518286592]{3:51776:993} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_STOP[ora.grac4.grac41.svc 1 1] ID 4099:47658
   --> Agent on grac41 stops the resource 
grac42:
  [   CRSPE][4141836032]{3:51776:993} Received reply to action [Stop] message ID: 47658
  [   CRSPE][4141836032]{3:51776:993} RI [ora.grac4.grac41.svc 1 1] new internal state: [STABLE] old value: [STOPPING]
  [   CRSPE][4141836032]{3:51776:993} RI [ora.grac4.grac41.svc 1 1] new external state [OFFLINE] old value: [ONLINE] label = []
   --> PE changes its internal state 
  [  CRSRPT][4139734784]{3:51776:993} Published to EVM CRS_RESOURCE_STATE_CHANGE for ora.grac4.grac41.svc
   --> PE publish the new service state 
  [   CRSPE][4141836032]{3:51776:993} CRS-2677: Stop of 'ora.grac4.grac41.svc' on 'grac41' succeeded
  [   CRSPE][4141836032]{3:51776:993} PE Command [ Stop Resource : 0x7fa008168b70 ] has completed
  [   CRSPE][4141836032]{3:51776:993} UI Command [Stop Resource : 0x7fa008168b70] is replying to sender. 
    --> Informs UIServer about successfull status change 
grac41:
 [    AGFW][1518286592]{3:51776:993} Agfw Proxy Server received the message: CMD_COMPLETED[Proxy] ID 20482:47662
 [    AGFW][1518286592]{3:51776:993} Agfw Proxy Server replying to the message: CMD_COMPLETED[Proxy] ID 20482:47662
 [UiServer][1505679104]{3:51776:993} Container [ Name: UI_DATA
 [UiServer][1505679104]{3:51776:993} Done for ctx=0x7f346c0276f0
  --> UIServer get confirmation that its STOP service command is completed

Leave a Reply

Your email address will not be published. Required fields are marked *