一套新上线的RAC系统(redhat 5.8 RAC 11.2.0.3) 安装过程中跑GI root.sh时抛出error 481错误:

CRS-2672: Attempting to start 'ora.asm' on 'ptdb1'
CRS-5017: The resource action "ora.asm start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to "(:CLSN00107:)" in "/g01/app/oracle/product/11.2.0/grid/log/agent/ohasd/oraagent_grid/oraagent_grid.log".
CRS-2674: Start of 'ora.asm' on 'racnode1' failed
..
Failed to start ASM at /software/app/11.2.0.3/crs/install/crsconfig_lib.pm line 1272

这一问题的解决可以参考如下文档:


ASM on Non First Node (Second or Other Node) Fails to Come up With: PMON (ospid: nnnn): terminating the instance due to error 481

In this Document
Purpose
Details
Case1: link local IP (169.254.x.x) is being used by other adapter/network
Case2: firewall exists between nodes on private network (iptables etc)
Case3: HAIP is up on some nodes but not on all
Case4: HAIP is up on all nodes but some do not have route info
References
Applies to:

Oracle Server – Enterprise Edition – Version 11.2.0.1 and later
Information in this document applies to any platform.
Purpose

This note lists common causes of ASM start up failure with the following error on non-first node (second or others):

alert_.log from non-first node
lmon registered with NM – instance number 2 (internal mem no 1)
Tue Dec 06 06:16:15 2011
System state dump requested by (instance=2, osid=19095 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /g01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_19138.trc
Tue Dec 06 06:16:15 2011
PMON (ospid: 19095): terminating the instance due to error 481
Dumping diagnostic data in directory=[cdmp_20111206061615], requested by (instance=2, osid=19095 (PMON)), summary=[abnormal instance termination].
Tue Dec 06 06:16:15 2011
ORA-1092 : opitsk aborting process

Note: ASM instance terminates shortly after “lmon registered with NM”
If ASM on non-first node was running previously, likely the following will be in alert.log when it failed originally:
..
IPC Send timeout detected. Sender: ospid 32231 [oracle@ftdcslsedw01b (PING)]
..
ORA-29740: evicted by instance number 1, group incarnation 10
..

diag trace from non-first ASM (+ASMn_diag_.trc)
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE])

alert_.log from first node
LMON (ospid: 15986) detects hung instances during IMR reconfiguration
LMON (ospid: 15986) tries to kill the instance 2 in 37 seconds.
Please check instance 2’s alert log and LMON trace file for more details.
..
Remote instance kill is issued with system inc 64
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Reconfiguration started (old inc 64, new inc 66)

If the issue happens while running root script (root.sh or rootupgrade.sh) as part of Grid Infrastructure installation/upgrade process, the following symptoms will present:

root script screen output
Start of resource “ora.asm” failed

CRS-2672: Attempting to start ‘ora.asm’ on ‘racnode1’
CRS-5017: The resource action “ora.asm start” encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to “(:CLSN00107:)” in “/ocw/grid/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log”.
CRS-2674: Start of ‘ora.asm’ on ‘racnode1’ failed
..
Failed to start ASM at /ispiris-qa/app/11.2.0.3/crs/install/crsconfig_lib.pm line 1272
$GRID_HOME/cfgtoollogs/crsconfig/rootcrs_.log
2011-11-29 15:56:48: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl start resource ora.asm -init
..
> CRS-2672: Attempting to start ‘ora.asm’ on ‘racnode1’
> CRS-5017: The resource action “ora.asm start” encountered the following error:
> ORA-03113: end-of-file on communication channel
> Process ID: 0
> Session ID: 0 Serial number: 0
> . For details refer to “(:CLSN00107:)” in “/ispiris-qa/app/11.2.0.3/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log”.
> CRS-2674: Start of ‘ora.asm’ on ‘racnode1’ failed
> CRS-2679: Attempting to clean ‘ora.asm’ on ‘racnode1’
> CRS-2681: Clean of ‘ora.asm’ on ‘racnode1’ succeeded
..
> CRS-4000: Command Start failed, or completed with errors.
>End Command output
2011-11-29 15:59:00: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl check resource ora.asm -init
2011-11-29 15:59:00: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl status resource ora.asm -init
2011-11-29 15:59:01: Checking the status of ora.asm
..
2011-11-29 15:59:53: Start of resource “ora.asm” failed
Details

Case1: link local IP (169.254.x.x) is being used by other adapter/network

Symptoms:

$GRID_HOME/log//alert.log
[/ocw/grid/bin/orarootagent.bin(4813)]CRS-5018:(:CLSN00037:) Removed unused HAIP route: 169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0
OS messages (optional)
Dec 6 06:11:14 racnode1 dhclient: DHCPREQUEST on usb0 to 255.255.255.255 port 67
Dec 6 06:11:14 racnode1 dhclient: DHCPACK from 169.254.95.118
ifconfig -a
..
usb0 Link encap:Ethernet HWaddr E6:1F:13:AD:EE:D3
inet addr:169.254.95.120 Bcast:169.254.95.255 Mask:255.255.255.0
..

Note: it’s usb0 in this case, but it can be any other adapter which uses link local

Solution:

Link local IP must not be used by any other network on cluster nodes. In this case, an USB network device gets IP 169.254.95.118 from DHCP server which disrupted HAIP routing, and solution is to black list the device in udev from being activated automatically.
Case2: firewall exists between nodes on private network (iptables etc)

No firewall is allowed on private network (cluster_interconnect) between nodes including software firewall like iptables, ipmon etc
Case3: HAIP is up on some nodes but not on all

Symptoms:

alert_<+ASMn>.log for some instances
Cluster communication is configured to use the following interface(s) for this instance
10.1.0.1
alert_<+ASMn>.log for other instances
Cluster communication is configured to use the following interface(s) for this instance
169.254.201.65

Note: some instances is using HAIP while others are not, so they can not talk to each other
Solution:

The solution is to bring up HAIP on all nodes.

To find out HAIP status, execute the following on all nodes:

$GRID_HOME/bin/crsctl stat res ora.cluster_interconnect.haip -init

If it’s offline, try to bring it up as root:

$GRID_HOME/bin/crsctl start res ora.cluster_interconnect.haip -init

If HAIP fails to start, refer to note 1210883.1 for known issues.

If the “up node” is not using HAIP, and no outage is allowed, the workaround is to set init.ora/spfile parameter cluster_interconnect to the private IP of each node to allow ASM/DB to come up on “down node”. Once a maintenance window is planned, the parameter must be removed to allow HAIP to work.

Case4: HAIP is up on all nodes but some do not have route info

Symptoms:

alert_<+ASMn>.log for all instances
Cluster communication is configured to use the following interface(s) for this instance
169.254.xxx.xxx
“netstat -rn” for some nodes (surviving nodes) missing HAIP route
netstat -rn
Destination Gateway Genmask Flags MSS Window irtt Iface
161.130.90.0 0.0.0.0 255.255.248.0 U 0 0 0 bond0
160.131.11.0 0.0.0.0 255.255.255.0 U 0 0 0 bond2
0.0.0.0 160.11.80.1 0.0.0.0 UG 0 0 0 bond0

The line for HAIP is missing, i.e:

169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond2

Note: As HAIP route info is missing on some nodes, HAIP is not pingable; usually newly restarted node will have HAIP route info
Solution:

The solution is to manually add HAIP route info on the nodes that’s missing:

4.1. Execute “netstat -rn” on any node that has HAIP route info and locate the following:

169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond2

Note: the first field is HAIP subnet ID and will start with 169.254.xxx.xxx, the third field is HAIP subnet netmask and the last field is private network adapter name

4.2. Execute the following as root on the node that’s missing HAIP route:

# route add -net netmask dev

i.e.

# route add -net 169.254.0.0 netmask 255.255.0.0 dev bond2

4.3. Start ora.crsd as root on the node that’s partial up:.

# $GRID_HOME/bin/crsctl start res ora.crsd -init

The other workaround is to restart GI on the node that’s missing HAIP route with “crsctl stop crs -f” and “crsctl start crs” command as root.

此时抛出的错误确实是usb0导致,kvm的控制台使用dbcp动态分配ip使得两台app服务器分配了169.254网段的ip.禁止掉dhcp之后手动分配ip继续抛出如下错误:

[cssd(22096)]CRS-1605:CSSD voting file is online: /dev/mapper/mpath3; details in /g01/app/oracle/product/11.2.0/grid/log/ptdb2/cssd/ocssd.log.
[cssd(22096)]CRS-1636:The CSS daemon was started in exclusive mode but found an active CSS daemon on node ptdb1 and is terminating; details at (:CSSNM00006:) in /g01/app/oracle/product/11.2.0/grid/log/ptdb2/cssd/ocssd.log
2012-10-15 10:07:33.421
[ohasd(20942)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'ptdb2'.

2012-10-15 10:11:27.669
[cssd(22425)]CRS-1662:Member kill requested by node ptdb1 for member number 1, group DB+ASM
2012-10-15 10:11:29.931
[/g01/app/oracle/product/11.2.0/grid/bin/oraagent.bin(22334)]CRS-5019:All OCR locations are on ASM disk groups [OCRVOT], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/g01/app/oracle/product/11.2.0/grid/log/ptdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".
2012-10-15 10:11:31.079
[/g01/app/oracle/product/11.2.0/grid/bin/oraagent.bin(22334)]CRS-5019:All OCR locations are on ASM disk groups [OCRVOT], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/g01/app/oracle/product/11.2.0/grid/log/ptdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".
2012-10-15 10:11:31.119
[/g01/app/oracle/product/11.2.0/grid/bin/oraagent.bin(22334)]CRS-5019:All OCR locations are on ASM disk groups [OCRVOT], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/g01/app/oracle/product/11.2.0/grid/log/ptdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".
2012-10-15 10:11:31.197
[/g01/app/oracle/product/11.2.0/grid/bin/oraagent.bin(22334)]CRS-5019:All OCR locations are on ASM disk groups [OCRVOT], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/g01/app/oracle/product/11.2.0/grid/log/ptdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".
2012-10-15 10:11:32.837
[/g01/app/oracle/product/11.2.0/grid/bin/orarootagent.bin(22496)]CRS-5016:Process "/g01/app/oracle/product/11.2.0/grid/bin/acfsload" spawned by agent "/g01/app/oracle/product/11.2.0/grid/bin/orarootagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/g01/app/oracle/product/11.2.0/grid/log/ptdb2/agent/ohasd/orarootagent_root/orarootagent_root.log"
2012-10-15 10:11:33.068
[/g01/app/oracle/product/11.2.0/grid/bin/oraagent.bin(22334)]CRS-5019:All OCR locations are on ASM disk groups [OCRVOT], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/g01/app/oracle/product/11.2.0/grid/log/ptdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".

查看 orarootagent_root.log

2012-10-15 10:08:40.295: [ USRTHRD][1115584832] {0:0:209} HAIP:  Updating member info HAIP1;172.168.0.0#0
2012-10-15 10:08:40.298: [ USRTHRD][1115584832] {0:0:209} InitializeHaIps[ 0]  infList 'inf eth1, ip 172.168.0.102, sub 172.168.0.0'
2012-10-15 10:08:40.299: [ USRTHRD][1115584832] {0:0:209} Error in getting Key SYSTEM.network.haip.group.cluster_interconnect.interface.valid in OCR
2012-10-15 10:08:40.303: [ CLSINET][1115584832] failed to open OLR HAIP subtype SYSTEM.network.haip.group.cluster_interconnect.interface.valid key, rc=4

问题就很好定位了,虽然所有node的res haip的状态为online 但是手动去ping对方 无法ping通 节点2无法通过haip获得必要资源

两个原因导致这个问题:

1. 网卡的driver有问题,导致在private 网卡上虚拟出来的haip 无法通信

2. 交换机的配置有问题有可能vlan的划分 或者交换机的bug导致这一现象

所以对上述文档的case ,我们可以增加一种:

CASE 5

如果HAIP 均online 并且 netstat -rn 均存在相关entry,需要保证node之间的haip能够ping通。在此例中 如果手工”restart res haip”则问题可以得到暂时解决.但是如果reboot机器 问题依旧,最后采取交叉线直连两个节点解决问题,如果节点数大于2 需要从交换机着手此问题。

对于11.2 CRS troubleshooting ,主要的日志有下面这些:
<GRID_HOME>/log/<host>/alert<node>.log :这是CRS 的 alert log,CRS 的启动/停止,核心进程(crsd, ocssd etc.)的启动/停止,对于一些资源的检查异常都会记录下来。

<GRID_HOME>/log/<host>/crsd : 这是CRSD的日志,记录了CRSD 监控的资源的启动、停止以及异常,比如数据库实例、ASM、监听、磁盘组、VIP等。
<GRID_HOME>/log/<host>/cssd: 这是OCSSD进程的日志,记录了节点间的通讯情况,如果有网络心跳或者Voting disk的异常,或者OCSSD进程异常终止,都会记录下来。
<GRID_HOME>/log/<host>/evmd:这是事件监控(event monitor)进程的日志。
<GRID_HOME>/log/<host>/ohasd:这是OHASD进程的日志,有的时候CRS无法启动,可能需要查看这个日志。
<GRID_HOME>/log/<host>/mdnsd:DNS相关的日志
<GRID_HOME>/log/<host>/gipcd:进程间或者节点间通讯的日志
<GRID_HOME>/log/<host>/gpnpd: 关于GPNP(Grid Plug & Play Profile )的日志
<GRID_HOME>/log/<host>/gnsd(可选):Grid Naming Service 的日志
<GRID_HOME>/log/<host>/ctssd: 时间同步服务的日志。

<GRID_HOME>/log/<host>/agent/ohasd:
<GRID_HOME>/log/<host>/agent/ohasd/oraagent_grid: 启动/停止/检查/清除ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd等资源。
<GRID_HOME>/log/<host>/agent/ohasd/orarootagent_root:启动/停止 /检查/清除 ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs, ora.crf (11.2.0.2)等资源。
<GRID_HOME>/log/<host>/agent/ohasd/oracssdagent_root: 启动/停止/检查 ocssd进程。
<GRID_HOME>/log/<host>/agent/ohasd/oracssdmonitor_root:监控cssdagent进程。

<GRID_HOME>/log/<host>/agent/crsd:
<GRID_HOME>/log/<host>/agent/crsd/oraagent_grid: 启动/停止/检查/清除 asm, ora.eons, ora.LISTENER.lsnr, SCAN listeners, ora.ons, diskgroup  等资源
<GRID_HOME>/log/<host>/agent/crsd/oraagent_oracle: 启动/停止/检查/清除 service, database 等资源
<GRID_HOME>/log/<host>/agent/crsd/orarootagent_root : 启动/停止/检查/清除 GNS, VIP, SCAN VIP and network 等资源.
<GRID_HOME>/log/<host>/agent/crsd/scriptagent_grid:  定制的应用服务的日志。