环境:redhat 5.5+oracle 10g rac+netapp

首先碰到DMM问题,netapp存储通过DMM软件mapping过来的设备 oracle无法识别 报如下错误:

the location /dev/mapper/ocr1 ,entered for the oracle cluster registry(OCR) is not shared across all the nodes in the cluster…

起初怀疑是device 无法共享, 分别在两个node进程dd if-of操作 发现导入内容在另一个节点可以识别,确保磁盘是共享的,查看mulitpatch毫无问题。

查看metalink发现问题;Configuring raw devices (multipath) for Oracle Clusterware 10g Release 2 (10.2.0) on RHEL5/OEL5 [ID 564580.1]

During the installation of Oracle Clusterware 10g Release 2 (10.2.0), the Universal Installer (OUI) is unable to verify the sharedness of block devices, therefore requires the use of raw devices (whether to singlepath or multipath devices) to be specified for OCR and voting disks. As mentioned earlier, this is no longer the case from Oracle11g R1 (11.1.0) that can use multipathed block devices directly

也就是说10g中的ocr,vote必须放在raw设备文件上,11g开始支持DMM mapping出来的文件

启动raw device –> mapping 到 DMM设备

# raw /dev/raw/raw1 /dev/mapper/ocr1
/dev/raw/raw1: bound to major 253, minor 11
# raw /dev/raw/raw2 /dev/mapper/ocr2
/dev/raw/raw2: bound to major 253, minor 8
# raw /dev/raw/raw3 /dev/mapper/ocr3
/dev/raw/raw3: bound to major 253, minor 10
# raw /dev/raw/raw4 /dev/mapper/voting1
/dev/raw/raw4: bound to major 253, minor 5
# raw /dev/raw/raw5 /dev/mapper/voting2
/dev/raw/raw5: bound to major 253, minor 4
# raw /dev/raw/raw6 /dev/mapper/voting3
/dev/raw/raw6: bound to major 253, minor 7

Redhat 5.5版本重新启用了rawdevice service 可以直接使用以上命令

安装继续 到了最后一步 要执行 crs root.sh脚本时 报错:

PROT-1: Failed to initialize ocrconfig
Failed to upgrade Oracle Cluster Registry configuration

./clsfmt ocr /dev/raw/raw1
clsfmt: Received unexpected error 4 from skgfifi
skgfifi: Additional information: -2
Additional information: 1000718336

Changes

It has been found that the following changes can cause this problem to occur:

1. Use Mutiple Path (MP) disk configuration, may hit this issue.
2. Use EMC device (powerpath**) may hit this issue.

But it was not confirmed that these are the only things that can cause this problem to occur, as it has been found that on other hardware and configuration the problem might occur, the key change in this issue is that if the disk size presented from the storage to the cluster nodes are not dividable by 4K the problem should occur.

我们正是使用了DMM软件 触发了此bug,将Patch:4679769 打上 继续到node2节点执行vipca时报错:

[root@db-35 bin]# ./vipca
/home/oracle/product/10.2/crs/jdk/jre//bin/java: error while loading shared libraries: libpthread.so.0: cannot open shared object file: No such file or directory

解决方法

两个节点执行

修改vipca 脚本

vi vipca

Linux) LD_LIBRARY_PATH=$ORACLE_HOME/lib:$ORACLE_HOME/srvm/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

#Remove this workaround when the bug 3937317 is fixed
arch=`uname -m`
if [ “$arch” = “i686” -o “$arch” = “ia64” -o “$arch” = “x86_64” ]
then
LD_ASSUME_KERNEL=2.4.19
export LD_ASSUME_KERNEL
fi
#End workaround
add -> unset LD_ASSUME_KERNEL

在srvctl 中 同样修改

LD_ASSUME_KERNEL=2.4.19
export LD_ASSUME_KERNEL

add ->unset LD_ASSUME_KERNEL

至此crs 安装完毕,安装database software 下载p8202632_10205_Linux-x86-64.zip 将crs和dbsoft 升级至10.2.0.5 DBCA 建库完成

修改Diagwait:

Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions [ID 559365.1]


Symptoms

Oracle Clusterware evicts the node from the cluster when

Node is not pinging via the network heartbeat
Node is not pinging the Voting disk
Node is hung/busy and is unable to perform either of the earlier tasks
In Most cases when the node is evicted, there is information written to the logs to analyze the cause of the node eviction. However in certain cases this may be missing, the steps documented in this note are to be used for those cases where there is not enough information or no information to diagnose the cause of the eviction for Clusterware versions less than 11gR2 (11.2.0.1).

Starting with 11.2.0.1, Customers do not need to set diagwait as the architecture has been changed.

Changes

None

Cause

When the node is evicted and the node is extremely busy in terms of CPU (or lack of it) it is possible that the OS did not get time to flush the logs/traces to the file system. It may be useful to set diagwait attribute to delay the node reboot to give additional time to the OS to write the traces. This setting will provide more time for diagnostic data to be collected by safely and will NOT increase probability of corruption. After setting diagwait, the Clusterware will wait an additional 10 seconds (Diagwait – reboottime). Customers can unset diagwait by following the steps documented below after fixing their OS scheduling issues.

* — Diagwait can be set on windows but it does not change the behaviour as it does on Unix-Linux platforms

@ For internal Support Staff
Diagwait attribute was introduced in 10.2.0.3 and is included in 10.2.0.4 & 11.1.0.6 and higher releases. It has also been subsequently backported to 10.1.0.5 on most platforms. This means it is possible to set diagwait on 10.1.0.5 (or higher), 10.2.0.3 (or higher) and in 11.1.0.6 (or higher). If the command crsctl set/get css diagwait reports “unrecognized parameter diagwait specified” then it can be safely assumed that the Clusterware version does not the necessary fixes to implement diagwait. If that is the case then customer is adviced to apply the latest patchset available before attempting to set diagwait
Solution

It is important that the clusterware stack must be down on all the nodes when changing diagwait .The following steps provides the step-by-step instructions on setting diagwait.

Execute as root
#crsctl stop crs
#/bin/oprocd stop
Ensure that Clusterware stack is down on all nodes by executing
#ps -ef |egrep “crsd.bin|ocssd.bin|evmd.bin|oprocd”
This should return no processes. If there are clusterware processes running and you proceed to the next step, you will corrupt your OCR. Do not continue until the clusterware processes are down on all the nodes of the cluster.
From one node of the cluster, change the value of the “diagwait” parameter to 13 seconds by issuing the command as root:
#crsctl set css diagwait 13 -force
Check if diagwait is set successfully by executing. the following command. The command should return 13. If diagwait is not set, the following message will be returned “Configuration parameter diagwait is not defined”
#crsctl get css diagwait
Restart the Oracle Clusterware on all the nodes by executing:
#crsctl start crs
Validate that the node is running by executing:
#crsctl check crs
Unsetting/Removing diagwait

Customers should not unset diagwait without fixing the OS scheduling issues as that can lead to node evictions via reboot. Diagwait delays the node eviction (and reconfiguration) by diagwait (13) seconds and as such setting diagwait does not affect most customers.In case there is a need to remove diagwait, the above mentioned steps need to be followed except step 3 needs to be replaced by the following command
#crsctl unset css diagwait -f

(Note: the -f option must be used when unsetting diagwait since CRS will be down when doing so)

至此 这次rac的安装顺利完成