最近RAC 的case实在太多,越来越觉得公司逐渐去rac的正确。一套sunOS的系统 在安装完crs 10g之后 服务器不停的自动reboot,具体环境如下:

root@bmsa#showrev 
Hostname: bmsa
Hostid: 84f94303
Release: 5.10
Kernel architecture: sun4v
Application architecture: sparc
Hardware provider: Oracle Corporation
Domain: 
Kernel version: SunOS 5.10 Generic_147440-25

root@bmsb#psrinfo -v 
Status of virtual processor 0 as of: 10/18/2012 00:05:32
  on-line since 10/17/2012 23:46:06.
  The sparcv9 processor operates at 2998 MHz,
        and has a sparcv9 floating point processor.
.....

Status of virtual processor 31 as of: 10/18/2012 00:05:32
  on-line since 10/17/2012 23:46:07.
  The sparcv9 processor operates at 2998 MHz,
        and has a sparcv9 floating point processor. 

root@bmsb# /usr/sbin/prtconf | grep "Memory size"
Memory size: 32768 Megabytes 


-bash-3.2$ crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.bmsa.gsd   application    ONLINE    ONLINE    bmsa        
ora.bmsa.ons   application    ONLINE    ONLINE    bmsa        
ora.bmsa.vip   application    ONLINE    ONLINE    bmsa        
ora.bmsb.gsd   application    ONLINE    ONLINE    bmsb        
ora.bmsb.ons   application    ONLINE    ONLINE    bmsb        
ora.bmsb.vip   application    ONLINE    ONLINE    bmsb        




-bash-3.2$ crsctl query crs activeversion
CRS active version on the cluster is [10.2.0.1.0]
-bash-3.2$ crsctl query crs softwareversion
CRS software version on node [bmsa] is [10.2.0.1.0]

具体现象为5-10分钟 两个节点会自动reboot,除了cssd均无报错日志,oprocd 无相关日志,syslog无相关日志, ocssd.log如下:

CSSD]2012-10-17 23:40:23.955 >USER:    Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle.  All rights reserved.
[    CSSD]2012-10-17 23:40:23.955 >USER:    CSS daemon log for node bmsa, number 1, in cluster crs
[  clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=bmsaDBG_CSSD))
[    CSSD]2012-10-17 23:40:23.959 [1] >TRACE:   clssscmain: local-only set to false
[    CSSD]2012-10-17 23:40:23.966 [1] >TRACE:   clssnmReadNodeInfo: added node 1 (bmsa) to cluster
[    CSSD]2012-10-17 23:40:23.970 [1] >TRACE:   clssnmReadNodeInfo: added node 2 (bmsb) to cluster
[    CSSD]2012-10-17 23:40:23.973 [1] >TRACE:   clssgmInitCMInfo: Wait for remote node teermination set to 13 seconds
[    CSSD]2012-10-17 23:40:23.975 [5] >TRACE:   clssnm_skgxnmon: skgxn init failed, rc 1
[    CSSD]2012-10-17 23:40:23.975 [1] >TRACE:   clssnm_skgxnonline: Using vacuous skgxn monitor
[    CSSD]2012-10-17 23:40:23.975 [1] >TRACE:   clssnmInitNMInfo: misscount set to 30
[    CSSD]2012-10-17 23:40:23.981 [1] >TRACE:   clssnmDiskStateChange: state from 1 to 2 disk (0//dev/rdsk/c3t60A98000572D5A70614A6E3370535749d0s6)
[    CSSD]2012-10-17 23:40:23.983 [1] >TRACE:   clssnmDiskStateChange: state from 1 to 2 disk (1//dev/rdsk/c3t60A98000572D5A70614A6E337053574Fd0s6)
[    CSSD]2012-10-17 23:40:23.984 [1] >TRACE:   clssnmDiskStateChange: state from 1 to 2 disk (2//dev/rdsk/c3t60A98000572D5A70614A6E3370535751d0s6)
[    CSSD]2012-10-17 23:40:25.984 [6] >TRACE:   clssnmDiskStateChange: state from 2 to 4 disk (0//dev/rdsk/c3t60A98000572D5A70614A6E3370535749d0s6)
[    CSSD]2012-10-17 23:40:25.985 [7] >TRACE:   clssnmDiskStateChange: state from 2 to 4 disk (1//dev/rdsk/c3t60A98000572D5A70614A6E337053574Fd0s6)
[    CSSD]2012-10-17 23:40:25.987 [8] >TRACE:   clssnmDiskStateChange: state from 2 to 4 disk (2//dev/rdsk/c3t60A98000572D5A70614A6E3370535751d0s6)
[    CSSD]2012-10-17 23:40:25.991 [6] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(13) LATS(0) Disk lastSeqNo(13)
[    CSSD]2012-10-17 23:40:25.992 [7] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(13) LATS(0) Disk lastSeqNo(13)
[    CSSD]2012-10-17 23:40:25.992 [8] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(13) LATS(0) Disk lastSeqNo(13)
[    CSSD]2012-10-17 23:40:26.055 [1] >TRACE:   clssnmFatalInit: fatal mode enabled
[    CSSD]2012-10-17 23:40:26.066 [10] >TRACE:   clssnmconnect: connecting to node 1, flags 0x0001, connector 1
[    CSSD]2012-10-17 23:40:26.072 [10] >TRACE:   clssnmconnect: connecting to node 0, flags 0x0000, connector 1
[    CSSD]2012-10-17 23:40:26.072 [10] >TRACE:   clssnmClusterListener: Probing node(2)
[    CSSD]2012-10-17 23:40:26.076 [11] >TRACE:   clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_crs_1))
[    CSSD]2012-10-17 23:40:26.076 [11] >TRACE:   clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_bmsa_crs))
[    CSSD]2012-10-17 23:40:26.078 [15] >TRACE:   clssnmPollingThread: Connection complete
[    CSSD]2012-10-17 23:40:26.078 [16] >TRACE:   clssnmSendingThread: Connection complete
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmRcfgMgrThread: Connection complete
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmRcfgMgrThread: Local Join
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmDoSyncUpdate: Initiating sync 1
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmSetupAckWait: Ack message type (11) 
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmSetupAckWait: node(1) is ALIVE
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmSendSync: syncSeqNo(1)
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmWaitForAcks: Ack message type(11), ackCount(1)
[    CSSD]2012-10-17 23:40:26.078 [10] >TRACE:   clssnmHandleSync: Acknowledging sync: src[1] srcName[bmsa] seq[1] sync[1]
[    CSSD]2012-10-17 23:40:26.178 [1] >USER:    NMEVENT_SUSPEND [00][00][00][00]
[    CSSD]2012-10-17 23:40:27.078 [17] >TRACE:   clssnmWaitForAcks: done, msg type(11)
[    CSSD]2012-10-17 23:40:27.078 [17] >TRACE:   clssnmSetupAckWait: Ack message type (13) 
[    CSSD]2012-10-17 23:40:27.078 [17] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2012-10-17 23:40:27.078 [17] >TRACE:   clssnmSendVote: syncSeqNo(1)
[    CSSD]2012-10-17 23:40:27.078 [17] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)
[    CSSD]2012-10-17 23:40:27.079 [10] >TRACE:   clssnmSendVoteInfo: node(1) syncSeqNo(1)
[    CSSD]2012-10-17 23:40:28.079 [17] >TRACE:   clssnmWaitForAcks: done, msg type(13)
[    CSSD]2012-10-17 23:40:28.079 [17] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[    CSSD]2012-10-17 23:40:29.079 [17] >TRACE:   clssnmEvict: Start
[    CSSD]2012-10-17 23:40:29.079 [17] >TRACE:   clssnmWaitOnEvictions: Start
[    CSSD]2012-10-17 23:40:29.080 [17] >TRACE:   clssnmWaitOnEvictions: Node(0) down, LATS(0),timeout(-109570438)
[    CSSD]2012-10-17 23:40:29.080 [17] >TRACE:   clssnmWaitOnEvictions: Node(2) down, LATS(0),timeout(-109570438)
[    CSSD]2012-10-17 23:40:29.080 [17] >TRACE:   clssnmSetupAckWait: Ack message type (15) 
[    CSSD]2012-10-17 23:40:29.080 [17] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2012-10-17 23:40:29.080 [17] >TRACE:   clssnmSendUpdate: syncSeqNo(1)
[    CSSD]2012-10-17 23:40:29.080 [17] >TRACE:   clssnmWaitForAcks: Ack message type(15), ackCount(1)
[    CSSD]2012-10-17 23:40:29.080 [10] >TRACE:   clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2012-10-17 23:40:29.080 [10] >TRACE:   clssnmDeactivateNode: node 0 () left cluster

[    CSSD]2012-10-17 23:40:29.080 [10] >TRACE:   clssnmUpdateNodeState: node 1, state (2/2) unique (1350488423/1350488423) prevConuni(0) birth (1/1) (old/new)
[    CSSD]2012-10-17 23:40:29.080 [10] >TRACE:   clssnmUpdateNodeState: node 2, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2012-10-17 23:40:29.080 [10] >TRACE:   clssnmDeactivateNode: node 2 (bmsb) left cluster

[    CSSD]2012-10-17 23:40:29.080 [10] >USER:    clssnmHandleUpdate: SYNC(1) from node(1) completed
[    CSSD]2012-10-17 23:40:29.080 [10] >USER:    clssnmHandleUpdate: NODE 1 (bmsa) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmReconfigThread:  started for reconfig (1)
[    CSSD]2012-10-17 23:40:29.180 [18] >USER:    NMEVENT_RECONFIG [00][00][00][02]
[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmEstablishConnections: 1 nodes in cluster incarn 1
[    CSSD]2012-10-17 23:40:29.180 [14] >TRACE:   clssgmPeerListener: connects done (1/1)
[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmEstablishMasterNode: MASTER for 1 is node(1) birth(1)
[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmChangeMasterNode: requeued 0 RPCs
[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmMasterCMSync: Synchronizing group/lock status
[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmMasterSendDBDone: group/lock status synchronization complete
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 1 with 1 nodes

[    CSSD]CLSS-3001: local node number 1, master node number 1

[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmReconfigThread:  completed for reconfig(1), with status(1)
[    CSSD]2012-10-17 23:40:29.281 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(1007738b0) proc(100b1e050) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:40:29.437 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b1bd90) proc(100b1ea20) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:40:30.080 [17] >TRACE:   clssnmWaitForAcks: done, msg type(15)
[    CSSD]2012-10-17 23:40:30.080 [17] >TRACE:   clssnmDoSyncUpdate: Sync Complete!
[    CSSD]2012-10-17 23:40:30.893 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b23370) proc(100b267b0) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:40:30.916 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b27090) proc(100b29350) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:40:34.623 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b29370) proc(100b27c60) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:40:34.642 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b23b40) proc(100b29600) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:41:34.799 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b22210) proc(100b27a90) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:41:34.815 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b23bf0) proc(100b29690) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:41:35.198 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b22210) proc(100b27a90) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:41:35.216 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b26510) proc(100b24ce0) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:41:35.525 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b221c0) proc(100b26720) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:41:35.529 [11] >WARNING: clssgmShutDown: Received explicit shutdown request from client.
[    CSSD]2012-10-17 23:41:35.529 [11] >TRACE:   clssgmClientShutdown: Aborting client (100b1f820) proc (100b1e050)
[    CSSD]2012-10-17 23:41:35.529 [11] >TRACE:   clssgmClientShutdown: Aborting client (100b20050) proc (100b1e050)
[    CSSD]2012-10-17 23:41:35.529 [11] >TRACE:   clssgmClientShutdown: Aborting client (100b20ab0) proc (100b1e050)
[    CSSD]2012-10-17 23:41:35.529 [11] >TRACE:   clssgmClientShutdown: Aborting client (100b1ed80) proc (100b1ea20)
[    CSSD]2012-10-17 23:41:35.529 [11] >TRACE:   clssgmClientShutdown: waited 0 seconds on 4 IO capable clients
[    CSSD]2012-10-17 23:41:35.539 [11] >WARNING: clssgmClientShutdown: graceful shutdown completed.

23:41:35.529 收到shutdown request 23:41:35.539 系统reboot,日志中”clssnm_skgxnmon: skgxn init failed, rc 1” 值得关注,对于这个错误,可以参考一些文章

DBA Note
RAC Reboot due to system time change
这两篇文章均提到了chang system time 导致的css reboot行为.并且提到了使用non fatal mode启动的方案:

In some extreme cases, it may be necessary to disable fatal mode for OPROCD to find a root cause. DO NOT do this in a production environment. This is completely unsupported and could cause data corruption due to lack of I/O fencing.

1. Back up the init.cssd file. Example:

Sun and Linux:
cp /etc/init.d/init.cssd /etc/init.d/init.cssd.bak

HP-UX and HP Tru64:
cp /sbin/init.d/init.cssd /sbin/init.d/init.cssd.bak

IBM AIX:
cp /etc/init.cssd /etc/init.cssd.bak

2. Stop the CRS stack or boot the node in single user mode.

To stop the CRS Stack:

Sun and Linux:
/etc/init.d/init.crs stop

HP-UX and HP Tru64:
/sbin/init.d/init.crs stop

IBM AIX:
/etc/init.crs/init.crs stop

3. Confirm that the CRS stack is down.

ps -ef | grep d.bin

4. Edit the init.cssd file from the location in step 1, change the OPROCD
startup line to a non-fatal startup:

Sun Example:

# in fatal mode we always will start OPROCD FATAL
if [ $OPROCD_EXISTS ]; then
$OPROCD start -t $OPROCD_DEFAULT_TIMEOUT -m $OPROCD_DEFAULT_MARGIN
$OPROCD check -t $OPROCD_CHECK_TIMEOUT 2>$NULL
fi

Change this to:

# in fatal mode we always will start OPROCD FATAL
if [ $OPROCD_EXISTS ]; then
$OPROCD startInstall -t $OPROCD_DEFAULT_TIMEOUT -m $OPROCD_DEFAULT_MARGIN
$OPROCD check -t $OPROCD_CHECK_TIMEOUT 2>$NULL
fi

You could also combine this method with the ‘tracing system calls’ method
for more debugging.

5. Reboot the node(s)

对于这种方法没有敢尝试,采用了修改cssd “DISABLE_OPROCD”的方法具体如下:

#!/bin/sh
#
# Copyright (c) 2001, 2010, Oracle and/or its affiliates. All rights reserved. 
#
# init.cssd - Control script for the Oracle CSS daemon.
#
#   In a full RAC install, this should not be in an rcX.d
#   directory. It should only be invoked from init.crs.
#   Never by hand.
#
#     No manual invocation of init.cssd is supported on a cluster.
#
#   In a local, non-cluster, installation without RAC, it should
#   be placed in an rcX.d directory. It may be invoked by hand
#   if necessary, however there are a series of risks and complications
#   that should be kept in mind when doing so.
#
#     Actions usable in local-only configuration are:
#          start, stop, disable and enable.
#   
# ==========================================================
# Porting Definitions and requirements:
#
# FAST_REBOOT - take out the machine now. We are concerned about
#               data integrity since the other node has evicted us.
# SLOW_REBOOT - We can rely on vendor clusterware to delay sending
#               the reconfig to the other node due to its IO fencing
#               guarantees. So trigger its diagnostic reboot.
# VC_UP - Determine whether Vendor clusterware processes are active.
#         If they are, then the remote node will hear that CSS/CLSVMON
#         have died, and we will be forced to use FAST_REBOOT.
#         This is also used at startup time for dependency checking.
#         Returns 0 if they are up, 1 otherwise. This should be
#         an extremely fast check.
# CLINFO - Determine whether we are booted in non-cluster mode.
#          Returns 0 for cluster mode, 1 for non-cluster mode
#          This call is allowed to take a long time to decide.
#
# GETBOOTID - Returns a string that uniquely identifies this boot.
#             This string must never change while the machine is up,
#             and must change the next time the machine boots.

ORA_CRS_HOME=/u01/app/oracle/product/10.2.0/crs_1
ORACLE_USER=oracle

ORACLE_HOME=$ORA_CRS_HOME

export ORACLE_HOME
export ORA_CRS_HOME
export ORACLE_USER

# Set DISABLE_OPROCD to false. Platforms that do not ship an oprocd
# binary should override this below.
DISABLE_OPROCD=false   -------------modify to true
# Default OPROCD timeout values defined here, so that it can be
# over-ridden as needed by a platform.
# default Timout of 1000 ms and a margin of 500ms
OPROCD_DEFAULT_TIMEOUT=1000
OPROCD_DEFAULT_MARGIN=500
# default Timeout for other actions
OPROCD_CHECK_TIMEOUT=2000

.................

手动reboot两个节点之后 root@bmsa#ps -ef |grep op
oracle 1706 1 0 09:23:56 ? 0:00 /u01/app/oracle/product/10.2.0/crs_1/opmn/bin/ons -d
oracle 1707 1706 0 09:23:56 ? 0:00 /u01/app/oracle/product/10.2.0/crs_1/opmn/bin/ons -d
root 14424 14345 0 11:36:50 pts/4 0:00 grep op

oprocd 进程已经消失 此时仍以fatal mode启动

[    CSSD]2012-10-17 23:46:46.883 [1] >TRACE:   clssnmFatalInit: fatal mode enabled
[    CSSD]2012-10-17 23:46:58.025 [19] >TRACE:   clssgmMasterSendDBDone: group/lock status synchronization complete
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 2 with 2 nodes

[    CSSD]CLSS-3001: local node number 1, master node number 1

[    CSSD]2012-10-17 23:46:58.025 [19] >TRACE:   clssgmReconfigThread:  completed for reconfig(2), with status(1)
[    CSSD]2012-10-17 23:46:58.959 [17] >TRACE:   clssnmWaitForAcks: done, msg type(15)
[    CSSD]2012-10-17 23:46:58.959 [17] >TRACE:   clssnmDoSyncUpdate: Sync Complete!

此时已经不再出现reboot行为,迅速升级到10.2.0.5的crs,修改回 cssd :

-bash-3.2$ ps -ef |grep op
  oracle 27803 15868   0 13:48:17 pts/4       0:00 grep op
    root  1172   743   0 09:23:41 ?           0:00 /bin/sh /etc/init.d/init.cssd oprocd
  oracle 24312     1   0 13:41:52 ?           0:00 /u01/app/oracle/product/10.2.0/crs_1/opmn/bin/ons -d
    root  1339  1172   0 09:23:41 ?           0:01 /u01/app/oracle/product/10.2.0/crs_1/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5
  oracle 24313 24312   0 13:41:52 ?           0:00 /u01/app/oracle/product/10.2.0/crs_1/opmn/bin/ons -d
  
-bash-3.2$ crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora....SM1.asm application    ONLINE    ONLINE    bmsa        
ora....SA.lsnr application    ONLINE    ONLINE    bmsa        
ora.bmsa.gsd   application    ONLINE    ONLINE    bmsa        
ora.bmsa.ons   application    ONLINE    ONLINE    bmsa        
ora.bmsa.vip   application    ONLINE    ONLINE    bmsa        
ora....SM2.asm application    ONLINE    ONLINE    bmsb        
ora....SB.lsnr application    ONLINE    ONLINE    bmsb        
ora.bmsb.gsd   application    ONLINE    ONLINE    bmsb        
ora.bmsb.ons   application    ONLINE    ONLINE    bmsb        
ora.bmsb.vip   application    ONLINE    ONLINE    bmsb        
ora....SM3.asm application    ONLINE    ONLINE    bmsc        
ora....SC.lsnr application    ONLINE    ONLINE    bmsc        
ora.bmsc.gsd   application    ONLINE    ONLINE    bmsc        
ora.bmsc.ons   application    ONLINE    ONLINE    bmsc        
ora.bmsc.vip   application    ONLINE    ONLINE    bmsc