发布对autoswitch over dg脚本DGHA测试,此脚本针对oracle dataguard设计使用共享存储存放redo以及controlfile从而达到了切换数据0丢失。
同时切换以后,original primary database可以无缝加入整个dg集群,从而形成了dataguard高可用方案。
采用perl脚本定制开发,可任意部署于任何一台可连接至集群的机器(最好不要用集群中的机器做监控机),多重判断机制,从而增加了切换时的可靠性以及准确性。感谢作者 noodba. noodba作为部门数据架构师,定制开发了myawr(mysql性能分析工具)

测试环境:

10.0.0.63 (primary database)
10.0.0.24 (standby database)
10.0.0.13 (standby database)
切换site指定为63<->24 13作为恒定standby存在

Step 1. 使用swingbench作为压测工具 (导入10G 数据)

QQ图片20130815220940

[oracle@userdb-pre admin]$ ora active
 17:23:15 up 1 day,  1:15,  5 users,  load average: 2.99, 0.97, 0.34

   SID USERNAME   MACHINE		 EVENT				PARAM			W   WT SQL			    ST	   LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
  1089 TEST	  JDBC Thin Client	 SQL*Net more data from client	1413697536/4/0	       -1    0 /5gkc10wnagprk		    A	    0	     105
   961 TEST	  JDBC Thin Client	 SQL*Net more data from client	1413697536/1/0	       -1    0 5gkc10wnagprk/5gkc10wnagprk  A	    0	     105
  1249 TEST	  JDBC Thin Client	 SQL*Net more data from client	1413697536/5/0	       -1    0 dy4rgmf46x1q4/dy4rgmf46x1q4  A	    0	     105

[oracle@userdb-pre admin]$ ora active
 17:23:15 up 1 day,  1:15,  5 users,  load average: 3.07, 1.02, 0.36

   SID USERNAME   MACHINE		 EVENT				PARAM			W   WT SQL			    ST	   LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
   961 TEST	  JDBC Thin Client	 SQL*Net message from client	1413697536/1/0		3    0 5gkc10wnagprk/5gkc10wnagprk  A	    0	     105
     1 TEST	  JDBC Thin Client	 SQL*Net message from client	1413697536/1/0		2    0 dy4rgmf46x1q4/dy4rgmf46x1q4  A	    0	     104

[oracle@userdb-pre admin]$ ora active
 17:23:49 up 1 day,  1:15,  4 users,  load average: 3.68, 1.35, 0.49

   SID USERNAME   MACHINE		 EVENT				PARAM			W   WT SQL			    ST	   LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
  1025 TEST	  JDBC Thin Client	 SQL*Net message from client	1413697536/1/0		1    0 d1yrz7run4p1s/d1yrz7run4p1s  A	    0	     139

Step 2. Halt database machine (将63 database直接halt关闭,模拟机器crash)

BP9OJX}OU[ZIT2MQ[98[5LA

观察10.0.0.24 (standby 将被dgha自动切换至primary,redo,controlfile将被copy至规定目录)

alert_log from 10.0.0.24

Thu Aug 15 17:37:50 2013
ALTER DATABASE   MOUNT
Changing di2dbun from dgha to dgha1
Successful mount of redo thread 1, with mount id 3944140846
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE   MOUNT
Thu Aug 15 17:37:55 2013
ALTER DATABASE RECOVER  database
Media Recovery Start
 started logmerger process
Parallel Media Recovery started with 48 slaves
Thu Aug 15 17:37:56 2013
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Media Recovery Complete (dgha)
Completed: ALTER DATABASE RECOVER  database
alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 32 processes
Started redo scan
Completed redo scan
 read 779943 KB redo, 0 data blocks need recovery
Started redo application at
 Thread 1: logseq 58, block 35644
Recovery of Online Redo Log: Thread 1 Group 1 Seq 58 Reading mem 0
  Mem# 0: /data1/dgha/redo01.log
Recovery of Online Redo Log: Thread 1 Group 2 Seq 59 Reading mem 0
  Mem# 0: /data1/dgha/redo02.log
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Completed redo application of 0.00MB
Completed crash recovery at
 Thread 1: logseq 60, block 31, scn 2500652
 0 data blocks read, 0 data blocks written, 779943 redo k-bytes read
LGWR: STARTING ARCH PROCESSES
...

Completed: ALTER DATABASE   MOUNT
Thu Aug 15 17:37:55 2013
ALTER DATABASE RECOVER  database
Media Recovery Start
 started logmerger process
Parallel Media Recovery started with 48 slaves
Thu Aug 15 17:37:56 2013
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Media Recovery Complete (dgha)
Completed: ALTER DATABASE RECOVER  database
alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 32 processes
Started redo scan
Completed redo scan
 read 779943 KB redo, 0 data blocks need recovery
Started redo application at
 Thread 1: logseq 58, block 35644
Recovery of Online Redo Log: Thread 1 Group 1 Seq 58 Reading mem 0
  Mem# 0: /data1/dgha/redo01.log
Recovery of Online Redo Log: Thread 1 Group 2 Seq 59 Reading mem 0
  Mem# 0: /data1/dgha/redo02.log
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Completed redo application of 0.00MB
Completed crash recovery at
 Thread 1: logseq 60, block 31, scn 2500652
 0 data blocks read, 0 data blocks written, 779943 redo k-bytes read
LGWR: STARTING ARCH PROCESSES
...

Archived Log entry 191 added for thread 1 sequence 61 ID 0xeb14a3a4 dest 1:
ARC3: Standby redo logfile selected for thread 1 sequence 60 for destination LOG_ARCHIVE_DEST_3
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_3
******************************************************************
LNS: Standby redo logfile selected for thread 1 sequence 62 for destination LOG_ARCHIVE_DEST_3
..


Completed: alter database open

数据库被recover到了最新的时间点(datafile scn领先于63 database,63此时可以看成24过去某个点的状态)


Step 3. third standby database (10.0.0.13)

alert_log from 10.0.0.13

Error 12543 received logging on to the standby
FAL[client, USER]: Error 12543 connecting to dgha for fetching gap sequence
Thu Aug 15 17:38:14 2013
Media Recovery Log /u01/oracle/11.2.0/oracle/product/11203/db1/dbs/arch1_61_823427815.dbf
Media Recovery Waiting for thread 1 sequence 62 (in transit)
Recovery of Online Redo Log: Thread 1 Group 4 Seq 62 Reading mem 0
  Mem# 0: /u01/oracle/oradata/dgha/standby04.log
Thu Aug 15 17:39:56 2013
Archived Log entry 62 added for thread 1 sequence 62 ID 0xeb14a3a4 dest 1:
Thu Aug 15 17:39:56 2013
RFS[14]: Selected log 4 for thread 1 sequence 63 dbid -350937948 branch 823427815
Thu Aug 15 17:39:56 2013
Media Recovery Waiting for thread 1 sequence 63 (in transit)
Recovery of Online Redo Log: Thread 1 Group 4 Seq 63 Reading mem 0
  Mem# 0: /u01/oracle/oradata/dgha/standby04.log
Thu Aug 15 17:40:08 2013
Archived Log entry 63 added for thread 1 sequence 63 ID 0xeb14a3a4 dest 1:
Thu Aug 15 17:40:08 2013
RFS[14]: Selected log 4 for thread 1 sequence 64 dbid -350937948 branch 823427815

可以看到 primary 自动切换为了10.0.0.24,13继续recover database.

Step 4. restart 63 database and join the dg cluster (24 is new primary database and 63 is standby database)

(1).copy standby controlfile from 24 database
(2).replace 63’s controlfile using new standby controlfile
(3).recreate standby logfiles
(4).alter database recover managed standby database;


alert_log:

..

Media Recovery Log /u01/oracle/arch/1_66_823427815.dbf
Media Recovery Waiting for thread 1 sequence 67 (in transit)

..
Physical standby database opened for read only access.
Completed: alter database open read only

Step 5. check data between 24 and 63 databases

from 24 database

SQL> select count(*) from test.CUSTOMERS              ;
select count(*) from test.WAREHOUSES             ;
select count(*) from test.ORDER_ITEMS            ;
select count(*) from test.ORDERS                 ;
select count(*) from test.INVENTORIES            ;
select count(*) from test.PRODUCT_INFORMATION    ;
select count(*) from test.LOGON                  ;
select count(*) from test.PRODUCT_DESCRIPTIONS   ;
select count(*) from test.ORDERENTRY_METADATA    ;

  COUNT(*)
----------
  39999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
 134986840

SQL> 
  COUNT(*)
----------
  45000000

SQL> 
  COUNT(*)
----------
    901565

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
   9999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
	 0

from 63 database

SQL> select count(*) from test.CUSTOMERS              ;
select count(*) from test.WAREHOUSES             ;
select count(*) from test.ORDER_ITEMS            ;
select count(*) from test.ORDERS                 ;
select count(*) from test.INVENTORIES            ;
select count(*) from test.PRODUCT_INFORMATION    ;
select count(*) from test.LOGON                  ;
select count(*) from test.PRODUCT_DESCRIPTIONS   ;
select count(*) from test.ORDERENTRY_METADATA    ;

  COUNT(*)
----------
  39999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
 134986840

SQL> 
  COUNT(*)
----------
  45000000

SQL> 
  COUNT(*)
----------
    901565

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
   9999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
	 0

perfect ! no rows lost.

—————————————————————————

下面我们来剖析一下dgha的行为

Thu Aug 15 17:25:57 2013 - [info] oraha start,welcome to use this tool...
Thu Aug 15 17:25:57 2013 - [info] try to check (10.0.0.63:1521:dgha) and (10.0.0.24:1521:dgha) are really cluster or not.
Thu Aug 15 17:25:57 2013 - [info] try ssh to standby(10.0.0.24,1521,dgha)...
Thu Aug 15 17:25:57 2013 - [info] ssh to 10.0.0.24 success.
Thu Aug 15 17:25:58 2013 - [debug] standby query result: (DGHA,PHYSICAL STANDBY,dgha1)
Thu Aug 15 17:25:58 2013 - [info] try connect to primary(10.0.0.63,1521,dgha)...
Thu Aug 15 17:25:58 2013 - [debug] primary check query sql: (select count(*) cnt from V$ARCHIVE_DEST A,v$database B where A.status='VALID' AND A.DB_UNIQUE_NAME='dgha1' AND A.TARGET='STANDBY' AND B.NAME='DGHA' AND B.DATABASE_ROLE='PRIMARY')
Thu Aug 15 17:25:58 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:25:59 2013 - [info] connect to (10.0.0.63:1521:dgha) success.
Thu Aug 15 17:25:59 2013 - [debug] primary query result rn_cnt: 1
Thu Aug 15 17:25:59 2013 - [info] (10.0.0.63:1521:dgha) and (10.0.0.24:1521:dgha) are really cluster.
Thu Aug 15 17:25:59 2013 - [info] the primary database is (10.0.0.63:1521:dgha)
Thu Aug 15 17:25:59 2013 - [info] the standby database is (10.0.0.24:1521:dgha)
Thu Aug 15 17:25:59 2013 - [info] ssh to 10.0.0.63 success.
Thu Aug 15 17:25:59 2013 - [info] ssh to 10.0.0.24 success.
Thu Aug 15 17:26:03 2013 - [info] ssh to 192.168.254.63 success.
Thu Aug 15 17:26:03 2013 - [info] ssh check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.63:1521:dgha) /home/oracle/oraha/db_stop.sh check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step01.sh check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step02.pl check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step03.sh check passed.
Thu Aug 15 17:26:04 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_check.sh check passed.
Thu Aug 15 17:26:04 2013 - [info] scripts under /home/oracle/oraha check passed.
Thu Aug 15 17:26:04 2013 - [info] I will wait until (10.0.0.63:1521:dgha) unreachable.health check interval:8
Thu Aug 15 17:26:04 2013 - [info] ======================================================
Thu Aug 15 17:26:04 2013 - [info] ----------             welcome              ----------
Thu Aug 15 17:26:04 2013 - [info] ======================================================
Thu Aug 15 17:26:04 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:0
Thu Aug 15 17:31:50 2013 - [debug] UPDATE HACHECK success.
Thu Aug 15 17:31:58 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:0

DGHA 会定期去update一张check表判断instance的存活

Thu Aug 15 17:32:03 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again.
Thu Aug 15 17:32:03 2013 - [debug] update check == 1.undef_count is 0
Thu Aug 15 17:32:03 2013 - [info] I will destory dbh.
Issuing rollback() due to DESTROY without explicit disconnect() of DBD::Oracle::db handle (DESCRIPTION=(ADDRESS=(HOST=10.0.0.63)(PROTOCOL=tcp)(PORT=1521))(CONNECT_DATA=(SID=dgha))) at oraha.pl line 509.
DBI db handle 0x1efbf710 cleared whilst still active at oraha.pl line 509.
DBI db handle 0x1efbf710 has uncleared implementors data at oraha.pl line 509.
    dbih_clearcom (dbh 0x1efbf710, com 0x1ee4d4f0, imp DBD::Oracle::db):
       FLAGS 0x182097: COMSET IMPSET Active Warn RaiseError PrintWarn ShowErrorStatement 
       PARENT DBI::dr=HASH(0x1ef1df90)
       KIDS 0 (0 Active)
Thu Aug 15 17:32:08 2013 - [warning] destory dbh  was abort!
Thu Aug 15 17:32:08 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:32:13 2013 - [error][oraha.pl, ln426] connect to  (10.0.0.63:1521:dgha) timeout
Thu Aug 15 17:32:13 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:1
Thu Aug 15 17:32:21 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:1
Thu Aug 15 17:32:21 2013 - [debug] the dbh is null.
Thu Aug 15 17:32:21 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:32:26 2013 - [error][oraha.pl, ln426] connect to  (10.0.0.63:1521:dgha) timeout
Thu Aug 15 17:32:26 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again.
Thu Aug 15 17:32:26 2013 - [debug] update check == 1.undef_count is 1
Thu Aug 15 17:32:26 2013 - [info] I will destory dbh.
Thu Aug 15 17:32:26 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:2
Thu Aug 15 17:32:34 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:2
Thu Aug 15 17:32:34 2013 - [debug] the dbh is null.
Thu Aug 15 17:32:34 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:32:39 2013 - [error][oraha.pl, ln426] connect to  (10.0.0.63:1521:dgha) timeout
Thu Aug 15 17:32:39 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again.
Thu Aug 15 17:32:39 2013 - [debug] update check == 1.undef_count is 1
Thu Aug 15 17:32:39 2013 - [info] I will destory dbh.
Thu Aug 15 17:32:39 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:3
Thu Aug 15 17:32:39 2013 - [info] try connect to the edm...
Thu Aug 15 17:32:39 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'13817963180','the primary(10.0.0.63:1521:dgha) is unreachable.','',1,'11',sysdate)
Thu Aug 15 17:32:39 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'18666668061','the primary(10.0.0.63:1521:dgha) is unreachable.','',1,'11',sysdate)
Thu Aug 15 17:32:47 2013 - [warning] ssh to 10.0.0.63 err. error_count:1
ssh: connect to host 10.0.0.63 port 22: No route to host
Thu Aug 15 17:32:50 2013 - [warning] ssh to 10.0.0.63 err. error_count:2
ssh: connect to host 10.0.0.63 port 22: No route to host
Thu Aug 15 17:32:53 2013 - [warning] ssh to 10.0.0.63 err. error_count:3
Thu Aug 15 17:32:54 2013 - [info] ssh to 10.0.0.24 success.
Thu Aug 15 17:32:57 2013 - [info] ssh to 192.168.254.63 success.

DGHA 在update多次之后发现oracle死亡,开始尝试ssh到63同样不通(将会连接至63的远程管理卡poweroff掉该机器)同理将会查看待选standby状态,如果可以切换将会启动切换程序,同时发送短信告知63已经死亡。

QQ图片20130815221021

Thu Aug 15 17:32:57 2013 - [info] try to check (10.0.0.24:1521:dgha) is ready to failover to primary or not.
Thu Aug 15 17:32:57 2013 - [info] the standby open mode is:MOUNTED. database role is:PHYSICAL STANDBY.
Thu Aug 15 17:32:57 2013 - [info] standby lag minutes is: 0
Thu Aug 15 17:32:57 2013 - [info] standby (10.0.0.24:1521:dgha) healthcheck is passed. Go on.
Thu Aug 15 17:32:57 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step01.sh check passed.
Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step02.pl check passed.
Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step03.sh check passed.
Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_check.sh check passed.
Thu Aug 15 17:32:58 2013 - [info] standby (10.0.0.24:1521:dgha) scripts under /home/oracle/oraha check passed.
Thu Aug 15 17:32:58 2013 - [info] Power of 10.0.0.63  was successfully turned off.
Thu Aug 15 17:32:58 2013 - [info] I will execute /home/oracle/oraha/db_start_step01.sh on 10.0.0.24
Thu Aug 15 17:35:58 2013 - [debug] db_start output: 
SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 15 17:32:58 2013

Copyright (c) 1982, 2011, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> 
Thu Aug 15 17:35:58 2013 - [error][oraha.pl, ln180] Oracle on 10.0.0.24 shutdown failed.Kill the pmon next.
Thu Aug 15 17:35:58 2013 - [debug] Oracle pmon pid is: 10179
Thu Aug 15 17:35:58 2013 - [debug] kill -9  output: 
Thu Aug 15 17:36:08 2013 - [info] I will execute /home/oracle/oraha/db_start_step02.pl on 10.0.0.24
Thu Aug 15 17:37:47 2013 - [debug] db_start output: 
Thu Aug 15 17:37:47 2013 - [info] I will execute /home/oracle/oraha/db_start_step03.sh on 10.0.0.24
Thu Aug 15 17:38:06 2013 - [debug] db_start output: 
SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 15 17:37:47 2013

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> ORACLE instance started.

Total System Global Area 4.0486E+10 bytes
Fixed Size                  2237088 bytes
Variable Size            4026535264 bytes
Database Buffers         3.6373E+10 bytes
Redo Buffers               84631552 bytes
Database mounted.
SQL> Media recovery complete.
SQL> 
Database altered.

SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Thu Aug 15 17:38:08 2013 - [info] try to check the new primary(10.0.0.24,1521,dgha) after failover .
Thu Aug 15 17:38:08 2013 - [info] try connect to the new primary(10.0.0.24,1521,dgha)...
Thu Aug 15 17:38:08 2013 - [info] Attemp to connect to (10.0.0.24:1521:dgha)...
Thu Aug 15 17:38:08 2013 - [info] connect to (10.0.0.24:1521:dgha) success.
Thu Aug 15 17:38:08 2013 - [info] the new primary open mode is:READ WRITE. database role is:PRIMARY.
Thu Aug 15 17:38:08 2013 - [info] (10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.
Thu Aug 15 17:38:08 2013 - [info] try connect to the edm...
Thu Aug 15 17:38:08 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'13817963180','(10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.','',1,'11',sysdate)
Thu Aug 15 17:38:08 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'18666668061','(10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.','',1,'11',sysdate)
Killed by signal 1.
Killed by signal 1.

使用存储redo recover完毕后同样会发出短信告知并退出。

QQ图片20130815221008

以下附各种场景的切换测试:

1. 模拟instance hang (oradebug frozen instance) –>当前连接全部hang死,新建process去update check表,两种情况:如果update成功,不切换,如果hang导致update不成功,切换。但是之前的process会处于”僵尸”态。

2. 模拟listener hang (kill -STOP pid of listener and kill active session) 新建process无法update timeout超时–>ssh主机执行abort—>连接至远程卡powerff掉主机—>切换至备选standby.

3. 模拟机器down (直接halt机器,update,ssh均失败,判断死亡,直接切换)

4. 网卡down(处理方式类1)
1,4的处理方式会出现之前dgha建立的连接无法回收,采用在本地部署脚本针对dgha的机器名,idle时间做定期清理。

5. 新的primary切换完成后,就的primary如果需要重新加入集群(copy standby controlfile from new primary,standby logfile需要重建)