VMCD.ORG » Physical-standby standbylogfile checksum error

总结下今天DG碰到的问题，本来一个很简单的问题，被我们复杂化衍生出很多问题。

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 – 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> show parameter check

NAME TYPE VALUE
———————————— ———– ——————————
db_block_checking string FALSE
db_block_checksum string TRUE
log_checkpoint_interval integer 0
log_checkpoint_timeout integer 1800
log_checkpoints_to_alert boolean FALSE
SQL> show parameter filesystem

通过filesystemio_options设置DIO和ASYNC模式

NAME TYPE VALUE
———————————— ———– ——————————
filesystemio_options string SETALL
SQL>

SQL>alter database recover managed standby database disconnect from session using current logfile;

以下是alert.log信息：

Thu Dec 22 18:23:57 CST 2011
Waiting for all non-current ORLs to be archived…
Media Recovery Log /data/oracle/oradata/paystd/arch/1_3_770564180.dbf
MRP0: Background Media Recovery terminated with error 368
Thu Dec 22 18:23:57 CST 2011
Errors in file /data/oracle/oradata/paystd/bdump/paystd_mrp0_2983.trc:
ORA-00368: checksum error in redo log block
ORA-00353: log corruption near block 6144 change 175261 time 12/22/2011 17:00:09
ORA-00334: archived log: ‘/data/oracle/oradata/paystd/arch/1_3_770564180.dbf’

checksum检查发现错误，同事以前是做SUN的，认为是DIO导致的,修改filesystemio_options=async，之后我也没管，今天早上来问说正常了，我检查了下依然发现问题：

Errors in file /data/oracle/oradata/paystd/udump/paystd_rfs_5428.trc:
ORA-00367: checksum error in log file header
ORA-00315: log 5 of thread 0, wrong thread # 1 in header
ORA-00312: online log 5 thread 0: ‘/data/oracle/oradata/paystd/std1.log’

改成async依然不行？

检查system 设置：

[oracle@db57 bdump]$ cat /proc/slabinfo |grep kio
kioctx 43 72 320 12 1 : tunables 54 27 8 : slabdata 6 6 0
kiocb 11 30 256 15 1 : tunables 120 60 8 : slabdata 2 2 0
[oracle@db57 bdump]$

系统没有问题 async使用正常，至此我认为filesystemio_options并不是问题的根源，查看相关metalink文档：

Oracle Linux, Filesystem & I/O Type Supportability [ID 279069.1]

此图明确的指出了Oracle在linux5下所支持的IOtype（async sync dio) 并不会因为DG的原因所改变，在测试环境做了如下测试。

session1:

SQL> alter system set filesystemio_options = directio scope=spfile;

System altered.

SQL> shutdown abort;
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 218103808 bytes
Fixed Size 1218580 bytes
Variable Size 88082412 bytes
Database Buffers 121634816 bytes
Redo Buffers 7168000 bytes
Database mounted.
Database opened.

SQL> begin
2 for i in 1..10000 loop
3 insert into t1 values (i);
4 commit;
5 end loop;
6 end;
7 /

PL/SQL procedure successfully completed.

SQL> alter system switch logfile;

System altered.

SQL> alter system checkpoint;

System altered.

SQL>
SQL>
SQL> begin
2 for i in 1..1000 loop
3 insert into t1 values (i);
4 commit;
5 end loop;
6 end;
7 /

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.

SQL>
SQL>
SQL> alter system checkpoint;

System altered.

SQL>

session2:

[root@liu u01]# strace -o /u01/dbwn.log -p 18671
Process 18671 attached – interrupt to quit

pwrite64(20, “\6\242\0\0\323\21@\0\321\350\24\0\0\0\1\6\345v\0\0\2\0\’\0@\2\0\0\320\350\24\0″…, 8192, 37380096) = 8192
times(NULL) = 431759580
times(NULL) = 431759580
pwrite64(20, “\6\242\0\0\326\21@\0\321\350\24\0\0\0\1\6\31g\0\0\2\0\0\0@\2\0\0\320\350\24\0″…, 8192, 37404672) = 8192
times(NULL) = 431759581
times(NULL) = 431759581
pwrite64(20, “\6\242\0\0\333\21@\0\321\350\24\0\0\0\1\6\3624\0\0\2\0\0\0A\2\0\0\320\350\24\0″…, 8192, 37445632) = 8192
times(NULL) = 431759581
times(NULL) = 431759581
pwrite64(20, “\6\242\0\0\336\21@\0\321\350\24\0\0\0\1\6\302\221\0\0\2\0\0\0A\2\0\0\320\350\24\0″…, 8192, 37470208) = 8192
times(NULL) = 431759581
times(NULL) = 431759581
pwrite64(20, “\6\242\0\0\”+@\0\300\343\24\0\0\0\1\6\2358\0\0\1\0\0\0\351\23\0\0\277\343\24\0″…, 8192, 90456064) = 8192
times(NULL) = 431759581
times(NULL) = 431759581
pwrite64(20, “\6\242\0\0002+@\0\300\343\24\0\0\0\1\6|Z\0\0\2\0\5\0\353\23\0\0\277\343\24\0″…, 8192, 90587136) = 8192

证明了此观点。OK 那么这种现象只能定义为block corrupt–>对应 checksum error in redo log block,这么简单的现象，因为一个filesystemio设置，引申出如此多的疑虑。

解决方法:clear or drop logfile;

我们采取 clear logfile的方法：

SQL> alter database clear logfile ‘/data/oracle/oradata/paystd/std5.log’
Clearing online log 9 of thread 0 sequence number 0
Completed: alter database clear logfile ‘/data/oracle/oradata/paystd/std5.log’

SQL> alter system set filesystemio_options=setall scope=spfile;

System altered.

SQL> shutdown abort;
ORACLE instance shut down.
SQL> startup mount;
ORACLE instance started.

Total System Global Area 1.2885E+10 bytes
Fixed Size 2118256 bytes
Variable Size 1744833936 bytes
Database Buffers 1.1123E+10 bytes
Redo Buffers 14655488 bytes
Database mounted.
SQL> alter database recover managed standby database disconnect from session using current logfile;

Database altered.

SQL>

查看alert:

Recovery of Online Redo Log: Thread 1 Group 5 Seq 44 Reading mem 0
Mem# 0: /data/oracle/oradata/paystd/std1.log
Fri Dec 23 12:52:09 CST 2011
Completed: ALTER DATABASE RECOVER managed standby database using current logfile disconnect
Redo Shipping Client Connected as PUBLIC
— Connected User is Valid
RFS[2]: Assigned to RFS process 5942
RFS[2]: Identified database type as ‘physical standby’
Primary database is in MAXIMUM PERFORMANCE mode
Primary database is in MAXIMUM PERFORMANCE mode
RFS[2]: Successfully opened standby log 6: ‘/data/oracle/oradata/paystd/std2.log’
Fri Dec 23 13:31:15 CST 2011
RFS[1]: Successfully opened standby log 5: ‘/data/oracle/oradata/paystd/std1.log’
Fri Dec 23 13:31:16 CST 2011
Media Recovery Waiting for thread 1 sequence 45 (in transit)
Fri Dec 23 13:31:16 CST 2011
Recovery of Online Redo Log: Thread 1 Group 6 Seq 45 Reading mem 0
Mem# 0: /data/oracle/oradata/paystd/std2.log

APPLY 正常问题解决

总结: 一个比较简单的问题，因为同事的一句话:设置filesystemio为ASYNC解决了，导致后面一连串的认证工作，以至于怀疑DG是否允许DIO特性，看来自己对待问题的态度还不够严谨，但是由此延伸到filesystemio问题也不失为一件好事，以此记之。