发布对autoswitch over dg脚本DGHA测试,此脚本针对oracle dataguard设计使用共享存储存放redo以及controlfile从而达到了切换数据0丢失。
同时切换以后,original primary database可以无缝加入整个dg集群,从而形成了dataguard高可用方案。
采用perl脚本定制开发,可任意部署于任何一台可连接至集群的机器(最好不要用集群中的机器做监控机),多重判断机制,从而增加了切换时的可靠性以及准确性。感谢作者 noodba. noodba作为部门数据架构师,定制开发了myawr(mysql性能分析工具)
测试环境:
10.0.0.63 (primary database)
10.0.0.24 (standby database)
10.0.0.13 (standby database)
切换site指定为63<->24 13作为恒定standby存在
Step 1. 使用swingbench作为压测工具 (导入10G 数据)
[oracle@userdb-pre admin]$ ora active 17:23:15 up 1 day, 1:15, 5 users, load average: 2.99, 0.97, 0.34 SID USERNAME MACHINE EVENT PARAM W WT SQL ST LT LOGON_TIME ------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ---------- 1089 TEST JDBC Thin Client SQL*Net more data from client 1413697536/4/0 -1 0 /5gkc10wnagprk A 0 105 961 TEST JDBC Thin Client SQL*Net more data from client 1413697536/1/0 -1 0 5gkc10wnagprk/5gkc10wnagprk A 0 105 1249 TEST JDBC Thin Client SQL*Net more data from client 1413697536/5/0 -1 0 dy4rgmf46x1q4/dy4rgmf46x1q4 A 0 105 [oracle@userdb-pre admin]$ ora active 17:23:15 up 1 day, 1:15, 5 users, load average: 3.07, 1.02, 0.36 SID USERNAME MACHINE EVENT PARAM W WT SQL ST LT LOGON_TIME ------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ---------- 961 TEST JDBC Thin Client SQL*Net message from client 1413697536/1/0 3 0 5gkc10wnagprk/5gkc10wnagprk A 0 105 1 TEST JDBC Thin Client SQL*Net message from client 1413697536/1/0 2 0 dy4rgmf46x1q4/dy4rgmf46x1q4 A 0 104 [oracle@userdb-pre admin]$ ora active 17:23:49 up 1 day, 1:15, 4 users, load average: 3.68, 1.35, 0.49 SID USERNAME MACHINE EVENT PARAM W WT SQL ST LT LOGON_TIME ------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ---------- 1025 TEST JDBC Thin Client SQL*Net message from client 1413697536/1/0 1 0 d1yrz7run4p1s/d1yrz7run4p1s A 0 139
Step 2. Halt database machine (将63 database直接halt关闭,模拟机器crash)
观察10.0.0.24 (standby 将被dgha自动切换至primary,redo,controlfile将被copy至规定目录)
alert_log from 10.0.0.24
Thu Aug 15 17:37:50 2013 ALTER DATABASE MOUNT Changing di2dbun from dgha to dgha1 Successful mount of redo thread 1, with mount id 3944140846 Database mounted in Exclusive Mode Lost write protection disabled Completed: ALTER DATABASE MOUNT Thu Aug 15 17:37:55 2013 ALTER DATABASE RECOVER database Media Recovery Start started logmerger process Parallel Media Recovery started with 48 slaves Thu Aug 15 17:37:56 2013 Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0 Mem# 0: /data1/dgha/redo03.log Media Recovery Complete (dgha) Completed: ALTER DATABASE RECOVER database alter database open Beginning crash recovery of 1 threads parallel recovery started with 32 processes Started redo scan Completed redo scan read 779943 KB redo, 0 data blocks need recovery Started redo application at Thread 1: logseq 58, block 35644 Recovery of Online Redo Log: Thread 1 Group 1 Seq 58 Reading mem 0 Mem# 0: /data1/dgha/redo01.log Recovery of Online Redo Log: Thread 1 Group 2 Seq 59 Reading mem 0 Mem# 0: /data1/dgha/redo02.log Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0 Mem# 0: /data1/dgha/redo03.log Completed redo application of 0.00MB Completed crash recovery at Thread 1: logseq 60, block 31, scn 2500652 0 data blocks read, 0 data blocks written, 779943 redo k-bytes read LGWR: STARTING ARCH PROCESSES ... Completed: ALTER DATABASE MOUNT Thu Aug 15 17:37:55 2013 ALTER DATABASE RECOVER database Media Recovery Start started logmerger process Parallel Media Recovery started with 48 slaves Thu Aug 15 17:37:56 2013 Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0 Mem# 0: /data1/dgha/redo03.log Media Recovery Complete (dgha) Completed: ALTER DATABASE RECOVER database alter database open Beginning crash recovery of 1 threads parallel recovery started with 32 processes Started redo scan Completed redo scan read 779943 KB redo, 0 data blocks need recovery Started redo application at Thread 1: logseq 58, block 35644 Recovery of Online Redo Log: Thread 1 Group 1 Seq 58 Reading mem 0 Mem# 0: /data1/dgha/redo01.log Recovery of Online Redo Log: Thread 1 Group 2 Seq 59 Reading mem 0 Mem# 0: /data1/dgha/redo02.log Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0 Mem# 0: /data1/dgha/redo03.log Completed redo application of 0.00MB Completed crash recovery at Thread 1: logseq 60, block 31, scn 2500652 0 data blocks read, 0 data blocks written, 779943 redo k-bytes read LGWR: STARTING ARCH PROCESSES ... Archived Log entry 191 added for thread 1 sequence 61 ID 0xeb14a3a4 dest 1: ARC3: Standby redo logfile selected for thread 1 sequence 60 for destination LOG_ARCHIVE_DEST_3 ****************************************************************** LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2 ****************************************************************** ****************************************************************** LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_3 ****************************************************************** LNS: Standby redo logfile selected for thread 1 sequence 62 for destination LOG_ARCHIVE_DEST_3 .. Completed: alter database open
数据库被recover到了最新的时间点(datafile scn领先于63 database,63此时可以看成24过去某个点的状态)
Step 3. third standby database (10.0.0.13)
alert_log from 10.0.0.13
Error 12543 received logging on to the standby FAL[client, USER]: Error 12543 connecting to dgha for fetching gap sequence Thu Aug 15 17:38:14 2013 Media Recovery Log /u01/oracle/11.2.0/oracle/product/11203/db1/dbs/arch1_61_823427815.dbf Media Recovery Waiting for thread 1 sequence 62 (in transit) Recovery of Online Redo Log: Thread 1 Group 4 Seq 62 Reading mem 0 Mem# 0: /u01/oracle/oradata/dgha/standby04.log Thu Aug 15 17:39:56 2013 Archived Log entry 62 added for thread 1 sequence 62 ID 0xeb14a3a4 dest 1: Thu Aug 15 17:39:56 2013 RFS[14]: Selected log 4 for thread 1 sequence 63 dbid -350937948 branch 823427815 Thu Aug 15 17:39:56 2013 Media Recovery Waiting for thread 1 sequence 63 (in transit) Recovery of Online Redo Log: Thread 1 Group 4 Seq 63 Reading mem 0 Mem# 0: /u01/oracle/oradata/dgha/standby04.log Thu Aug 15 17:40:08 2013 Archived Log entry 63 added for thread 1 sequence 63 ID 0xeb14a3a4 dest 1: Thu Aug 15 17:40:08 2013 RFS[14]: Selected log 4 for thread 1 sequence 64 dbid -350937948 branch 823427815
可以看到 primary 自动切换为了10.0.0.24,13继续recover database.
Step 4. restart 63 database and join the dg cluster (24 is new primary database and 63 is standby database)
(1).copy standby controlfile from 24 database
(2).replace 63’s controlfile using new standby controlfile
(3).recreate standby logfiles
(4).alter database recover managed standby database;
alert_log:
.. Media Recovery Log /u01/oracle/arch/1_66_823427815.dbf Media Recovery Waiting for thread 1 sequence 67 (in transit) .. Physical standby database opened for read only access. Completed: alter database open read only
Step 5. check data between 24 and 63 databases
from 24 database
SQL> select count(*) from test.CUSTOMERS ; select count(*) from test.WAREHOUSES ; select count(*) from test.ORDER_ITEMS ; select count(*) from test.ORDERS ; select count(*) from test.INVENTORIES ; select count(*) from test.PRODUCT_INFORMATION ; select count(*) from test.LOGON ; select count(*) from test.PRODUCT_DESCRIPTIONS ; select count(*) from test.ORDERENTRY_METADATA ; COUNT(*) ---------- 39999996 SQL> COUNT(*) ---------- 1000 SQL> COUNT(*) ---------- 134986840 SQL> COUNT(*) ---------- 45000000 SQL> COUNT(*) ---------- 901565 SQL> COUNT(*) ---------- 1000 SQL> COUNT(*) ---------- 9999996 SQL> COUNT(*) ---------- 1000 SQL> COUNT(*) ---------- 0
from 63 database
SQL> select count(*) from test.CUSTOMERS ; select count(*) from test.WAREHOUSES ; select count(*) from test.ORDER_ITEMS ; select count(*) from test.ORDERS ; select count(*) from test.INVENTORIES ; select count(*) from test.PRODUCT_INFORMATION ; select count(*) from test.LOGON ; select count(*) from test.PRODUCT_DESCRIPTIONS ; select count(*) from test.ORDERENTRY_METADATA ; COUNT(*) ---------- 39999996 SQL> COUNT(*) ---------- 1000 SQL> COUNT(*) ---------- 134986840 SQL> COUNT(*) ---------- 45000000 SQL> COUNT(*) ---------- 901565 SQL> COUNT(*) ---------- 1000 SQL> COUNT(*) ---------- 9999996 SQL> COUNT(*) ---------- 1000 SQL> COUNT(*) ---------- 0
perfect ! no rows lost.
—————————————————————————
下面我们来剖析一下dgha的行为
Thu Aug 15 17:25:57 2013 - [info] oraha start,welcome to use this tool... Thu Aug 15 17:25:57 2013 - [info] try to check (10.0.0.63:1521:dgha) and (10.0.0.24:1521:dgha) are really cluster or not. Thu Aug 15 17:25:57 2013 - [info] try ssh to standby(10.0.0.24,1521,dgha)... Thu Aug 15 17:25:57 2013 - [info] ssh to 10.0.0.24 success. Thu Aug 15 17:25:58 2013 - [debug] standby query result: (DGHA,PHYSICAL STANDBY,dgha1) Thu Aug 15 17:25:58 2013 - [info] try connect to primary(10.0.0.63,1521,dgha)... Thu Aug 15 17:25:58 2013 - [debug] primary check query sql: (select count(*) cnt from V$ARCHIVE_DEST A,v$database B where A.status='VALID' AND A.DB_UNIQUE_NAME='dgha1' AND A.TARGET='STANDBY' AND B.NAME='DGHA' AND B.DATABASE_ROLE='PRIMARY') Thu Aug 15 17:25:58 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)... Thu Aug 15 17:25:59 2013 - [info] connect to (10.0.0.63:1521:dgha) success. Thu Aug 15 17:25:59 2013 - [debug] primary query result rn_cnt: 1 Thu Aug 15 17:25:59 2013 - [info] (10.0.0.63:1521:dgha) and (10.0.0.24:1521:dgha) are really cluster. Thu Aug 15 17:25:59 2013 - [info] the primary database is (10.0.0.63:1521:dgha) Thu Aug 15 17:25:59 2013 - [info] the standby database is (10.0.0.24:1521:dgha) Thu Aug 15 17:25:59 2013 - [info] ssh to 10.0.0.63 success. Thu Aug 15 17:25:59 2013 - [info] ssh to 10.0.0.24 success. Thu Aug 15 17:26:03 2013 - [info] ssh to 192.168.254.63 success. Thu Aug 15 17:26:03 2013 - [info] ssh check passed. Thu Aug 15 17:26:03 2013 - [info] (10.0.0.63:1521:dgha) /home/oracle/oraha/db_stop.sh check passed. Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step01.sh check passed. Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step02.pl check passed. Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step03.sh check passed. Thu Aug 15 17:26:04 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_check.sh check passed. Thu Aug 15 17:26:04 2013 - [info] scripts under /home/oracle/oraha check passed. Thu Aug 15 17:26:04 2013 - [info] I will wait until (10.0.0.63:1521:dgha) unreachable.health check interval:8 Thu Aug 15 17:26:04 2013 - [info] ====================================================== Thu Aug 15 17:26:04 2013 - [info] ---------- welcome ---------- Thu Aug 15 17:26:04 2013 - [info] ====================================================== Thu Aug 15 17:26:04 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:0 Thu Aug 15 17:31:50 2013 - [debug] UPDATE HACHECK success. Thu Aug 15 17:31:58 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:0
DGHA 会定期去update一张check表判断instance的存活
Thu Aug 15 17:32:03 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again. Thu Aug 15 17:32:03 2013 - [debug] update check == 1.undef_count is 0 Thu Aug 15 17:32:03 2013 - [info] I will destory dbh. Issuing rollback() due to DESTROY without explicit disconnect() of DBD::Oracle::db handle (DESCRIPTION=(ADDRESS=(HOST=10.0.0.63)(PROTOCOL=tcp)(PORT=1521))(CONNECT_DATA=(SID=dgha))) at oraha.pl line 509. DBI db handle 0x1efbf710 cleared whilst still active at oraha.pl line 509. DBI db handle 0x1efbf710 has uncleared implementors data at oraha.pl line 509. dbih_clearcom (dbh 0x1efbf710, com 0x1ee4d4f0, imp DBD::Oracle::db): FLAGS 0x182097: COMSET IMPSET Active Warn RaiseError PrintWarn ShowErrorStatement PARENT DBI::dr=HASH(0x1ef1df90) KIDS 0 (0 Active) Thu Aug 15 17:32:08 2013 - [warning] destory dbh was abort! Thu Aug 15 17:32:08 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)... Thu Aug 15 17:32:13 2013 - [error][oraha.pl, ln426] connect to (10.0.0.63:1521:dgha) timeout Thu Aug 15 17:32:13 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:1 Thu Aug 15 17:32:21 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:1 Thu Aug 15 17:32:21 2013 - [debug] the dbh is null. Thu Aug 15 17:32:21 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)... Thu Aug 15 17:32:26 2013 - [error][oraha.pl, ln426] connect to (10.0.0.63:1521:dgha) timeout Thu Aug 15 17:32:26 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again. Thu Aug 15 17:32:26 2013 - [debug] update check == 1.undef_count is 1 Thu Aug 15 17:32:26 2013 - [info] I will destory dbh. Thu Aug 15 17:32:26 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:2 Thu Aug 15 17:32:34 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:2 Thu Aug 15 17:32:34 2013 - [debug] the dbh is null. Thu Aug 15 17:32:34 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)... Thu Aug 15 17:32:39 2013 - [error][oraha.pl, ln426] connect to (10.0.0.63:1521:dgha) timeout Thu Aug 15 17:32:39 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again. Thu Aug 15 17:32:39 2013 - [debug] update check == 1.undef_count is 1 Thu Aug 15 17:32:39 2013 - [info] I will destory dbh. Thu Aug 15 17:32:39 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:3 Thu Aug 15 17:32:39 2013 - [info] try connect to the edm... Thu Aug 15 17:32:39 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) values(edm_user.SEQ_QUE.NEXTVAL,'13817963180','the primary(10.0.0.63:1521:dgha) is unreachable.','',1,'11',sysdate) Thu Aug 15 17:32:39 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) values(edm_user.SEQ_QUE.NEXTVAL,'18666668061','the primary(10.0.0.63:1521:dgha) is unreachable.','',1,'11',sysdate) Thu Aug 15 17:32:47 2013 - [warning] ssh to 10.0.0.63 err. error_count:1 ssh: connect to host 10.0.0.63 port 22: No route to host Thu Aug 15 17:32:50 2013 - [warning] ssh to 10.0.0.63 err. error_count:2 ssh: connect to host 10.0.0.63 port 22: No route to host Thu Aug 15 17:32:53 2013 - [warning] ssh to 10.0.0.63 err. error_count:3 Thu Aug 15 17:32:54 2013 - [info] ssh to 10.0.0.24 success. Thu Aug 15 17:32:57 2013 - [info] ssh to 192.168.254.63 success.
DGHA 在update多次之后发现oracle死亡,开始尝试ssh到63同样不通(将会连接至63的远程管理卡poweroff掉该机器)同理将会查看待选standby状态,如果可以切换将会启动切换程序,同时发送短信告知63已经死亡。
Thu Aug 15 17:32:57 2013 - [info] try to check (10.0.0.24:1521:dgha) is ready to failover to primary or not. Thu Aug 15 17:32:57 2013 - [info] the standby open mode is:MOUNTED. database role is:PHYSICAL STANDBY. Thu Aug 15 17:32:57 2013 - [info] standby lag minutes is: 0 Thu Aug 15 17:32:57 2013 - [info] standby (10.0.0.24:1521:dgha) healthcheck is passed. Go on. Thu Aug 15 17:32:57 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step01.sh check passed. Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step02.pl check passed. Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step03.sh check passed. Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_check.sh check passed. Thu Aug 15 17:32:58 2013 - [info] standby (10.0.0.24:1521:dgha) scripts under /home/oracle/oraha check passed. Thu Aug 15 17:32:58 2013 - [info] Power of 10.0.0.63 was successfully turned off. Thu Aug 15 17:32:58 2013 - [info] I will execute /home/oracle/oraha/db_start_step01.sh on 10.0.0.24 Thu Aug 15 17:35:58 2013 - [debug] db_start output: SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 15 17:32:58 2013 Copyright (c) 1982, 2011, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> Thu Aug 15 17:35:58 2013 - [error][oraha.pl, ln180] Oracle on 10.0.0.24 shutdown failed.Kill the pmon next. Thu Aug 15 17:35:58 2013 - [debug] Oracle pmon pid is: 10179 Thu Aug 15 17:35:58 2013 - [debug] kill -9 output: Thu Aug 15 17:36:08 2013 - [info] I will execute /home/oracle/oraha/db_start_step02.pl on 10.0.0.24 Thu Aug 15 17:37:47 2013 - [debug] db_start output: Thu Aug 15 17:37:47 2013 - [info] I will execute /home/oracle/oraha/db_start_step03.sh on 10.0.0.24 Thu Aug 15 17:38:06 2013 - [debug] db_start output: SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 15 17:37:47 2013 Copyright (c) 1982, 2011, Oracle. All rights reserved. Connected to an idle instance. SQL> ORACLE instance started. Total System Global Area 4.0486E+10 bytes Fixed Size 2237088 bytes Variable Size 4026535264 bytes Database Buffers 3.6373E+10 bytes Redo Buffers 84631552 bytes Database mounted. SQL> Media recovery complete. SQL> Database altered. SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options Thu Aug 15 17:38:08 2013 - [info] try to check the new primary(10.0.0.24,1521,dgha) after failover . Thu Aug 15 17:38:08 2013 - [info] try connect to the new primary(10.0.0.24,1521,dgha)... Thu Aug 15 17:38:08 2013 - [info] Attemp to connect to (10.0.0.24:1521:dgha)... Thu Aug 15 17:38:08 2013 - [info] connect to (10.0.0.24:1521:dgha) success. Thu Aug 15 17:38:08 2013 - [info] the new primary open mode is:READ WRITE. database role is:PRIMARY. Thu Aug 15 17:38:08 2013 - [info] (10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful. Thu Aug 15 17:38:08 2013 - [info] try connect to the edm... Thu Aug 15 17:38:08 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) values(edm_user.SEQ_QUE.NEXTVAL,'13817963180','(10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.','',1,'11',sysdate) Thu Aug 15 17:38:08 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) values(edm_user.SEQ_QUE.NEXTVAL,'18666668061','(10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.','',1,'11',sysdate) Killed by signal 1. Killed by signal 1.
使用存储redo recover完毕后同样会发出短信告知并退出。
以下附各种场景的切换测试:
1. 模拟instance hang (oradebug frozen instance) –>当前连接全部hang死,新建process去update check表,两种情况:如果update成功,不切换,如果hang导致update不成功,切换。但是之前的process会处于”僵尸”态。
2. 模拟listener hang (kill -STOP pid of listener and kill active session) 新建process无法update timeout超时–>ssh主机执行abort—>连接至远程卡powerff掉主机—>切换至备选standby.
3. 模拟机器down (直接halt机器,update,ssh均失败,判断死亡,直接切换)
4. 网卡down(处理方式类1)
1,4的处理方式会出现之前dgha建立的连接无法回收,采用在本地部署脚本针对dgha的机器名,idle时间做定期清理。
5. 新的primary切换完成后,就的primary如果需要重新加入集群(copy standby controlfile from new primary,standby logfile需要重建)