帮朋友check一次中国银联数据库的坏块问题(并不是严格意义上的坏块),大致情况为smon需要trascation recover —>特定的block 发现无法读取 —>transaction recover出现问题 —>smon terminated

Wed Aug 15 19:02:57 2012
SMON: Parallel transaction recovery slave got internal error
SMON: Downgrading transaction recovery to serial
Wed Aug 15 19:02:57 2012
Errors in file /oracle/admin/settora/bdump/settora_smon_4755904.trc:
ORA-00600: internal error code, arguments: [kdourp_inorder2], [43], [43], [7], [44], [], [], []
Doing block recovery for file 10 block 176769
Doing block recovery for file 26 block 3259716

Block recovery from logseq 108030, block 49455 to scn 41320651147
Wed Aug 15 19:02:58 2012
Recovery of Online Redo Log: Thread 1 Group 5 Seq 108030 Reading mem 0
Mem# 0: /orasvr/settora/redo05.log
Block recovery completed at rba 108030.49555.16, scn 9.2665945485
ORACLE Instance settora (pid = 8) – Error 600 encountered while recovering transaction (7, 19) on object 331910.

过一段时间数据库就会down掉,从日志信息里面可以看到:

Wed Aug 15 19:23:18 2012
SMON: Parallel transaction recovery slave got internal error
SMON: Downgrading transaction recovery to serial
Wed Aug 15 19:23:18 2012
Errors in file /oracle/admin/settora/bdump/settora_smon_4755904.trc:
ORA-00600: internal error code, arguments: [kdourp_inorder2], [43], [43], [7], [44], [], [], []
Wed Aug 15 19:23:20 2012
Errors in file /oracle/admin/settora/bdump/settora_pmon_4690184.trc:
ORA-00474: SMON process terminated with error
Wed Aug 15 19:23:20 2012
PMON: terminating instance due to error 474
Instance terminated by PMON, pid = 4690184

执行下面SQL

select /*+full(a)*/ from PRO_SETTLE3.SHOPACCOUNT a;

returns:ORA-00600: internal error code, arguments: [kdourp_inorder2], [43], [43], [7], [44], [], [], []

使用dbv工具来检测:

$ dbv file=/orasvr1/settle5/settle000.dbf blocksize=8192

DBVERIFY: Release 10.2.0.4.0 - Production on Thu Aug 16 09:40:49 2012

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

DBVERIFY - Verification starting : FILE = /orasvr1/settle5/settle000.dbf
kdrchk:  row is marked as both Last and Next continue
          prow=0x1101aec98 flag=0x07
Block Checking: DBA = 42119809, Block Type = KTB-managed data block
data header at 0x1101ae07c
kdbchk: bad row tab 0, slot 51
Page 176769 failed with check code 6253


DBVERIFY - Verification complete

Total Pages Examined         : 386688
Total Pages Processed (Data) : 282817
Total Pages Failing   (Data) : 1
Total Pages Processed (Index): 98760
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 2707
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 2404
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Highest block SCN            : 2666197038 (9.2666197038)

Total Pages Failing (Data) : 1

Page 176769 被标识为corruption

Total Pages Examined         : 512      Number of blocks looked at
Total Pages Processed (Data) : 1        Number of TABLE blocks seen
Total Pages Failing   (Data) : 1        Number of TABLE blocks with internal inconsistencies
Total Pages Processed (Index): 0        Number of INDEX blocks seen
Total Pages Failing   (Index): 0        Number of INDEX block with internal inconsistencies
Total Pages Empty            : 507      Number of unused blocks seen
Total Pages Marked Corrupt   : 2        Number of blocks with corrupt cache wrappers
Total Pages Influx           : 0        Number of pages we re-read as the page looked like it was being modified when it was first read.
SQL> select file#,block#,blocks from v$database_block_corruption;
 
no rows selected

在数据库中并没有关于坏块的记录存在。这并不是一个普通意义上的坏块。对于这种块使用corrupt skip是没有用的skip_corrupt_blocks只能使oracle跳过Oracle能够读出的块,而如果在操作系统层read调用就失败的,则不能跳过该块。
oracle 在read page 176769时发生了error 6253 (cannot read arguments from address file,其实这并不是真正的root cause,我在和huangyong讨论过后基本认定为一个内部的inconsistent block) oracle 不能 mark 为 corrupt block.

这里的思路为使用rowid来跳过这些有问题的block 大致方法如下

SQL> SELECT dbms_rowid.rowid_create(1,331910,10,176769,0) LOW_RID from DUAL;

LOW_RID
——————
AABRCGAAKAAArKBAAA

SQL> SELECT dbms_rowid.rowid_create(1,331910,10,176770,0) HI_RID from dual;

HI_RID
——————
AABRCGAAKAAArKCAAA

SQL> SELECT dbms_rowid.rowid_create(1,331910,26,3259716,0) LOW_RID from DUAL;

LOW_RID
——————
AABRCGAAaAAMb1EAAA

SQL> SELECT dbms_rowid.rowid_create(1,331910,26,3259717,0) HI_RID from DUAL;

HI_RID
——————
AABRCGAAaAAMb1FAAA

SQL> create table PRO_SETTLE3.SHOPACCOUNT_1 as select /*+ rowid(a) */ * from PRO_SETTLE3.SHOPACCOUNT a where rowid < 'AABRCGAAKAAArKBAAA'; Table created. SQL> insert into PRO_SETTLE3.SHOPACCOUNT_1 select /*+ rowid(a) */ * from PRO_SETTLE3.SHOPACCOUNT a where rowid >= ‘AABRCGAAKAAArKCAAA’ and rowid < 'AABRCGAAaAAMb1EAAA'; 49626 rows created. SQL> insert into PRO_SETTLE3.SHOPACCOUNT_1 select /*+ rowid(a) */ * from PRO_SETTLE3.SHOPACCOUNT a where rowid >= ‘AABRCGAAaAAMb1FAAA’;

208837 rows created.

SQL> select count(*) from PRO_SETTLE3.SHOPACCOUNT_1;

COUNT(*)
———-
262256

之后drop掉这张有问题的table restart database 恢复正常.

同样我们可以采用Kerry Osborne的脚本来实现.这里稍微修改了下 增加了选择tablespace 的功能temp

eg:

[oracle@testdb ~]$ dd if=/dev/zero  of=/data/oracle/oradata/wuxuan1/liu.dbf   bs=8192 seek=1798  count=1 conv=notrunc
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 5.2e-05 seconds, 158 MB/s
[oracle@testdb ~]$ sqlplus liu/liu

SQL*Plus: Release 11.2.0.3.0 Production on Fri Aug 17 02:32:20 2012

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Error accessing PRODUCT_USER_PROFILE
Warning:  Product user profile information not loaded!
You may need to run PUPBLD.SQL as SYSTEM

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> select count(*) from temp;
select count(*) from temp
                     *
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'


SQL> @temp	

WARNING: This script may issue a DROP TABLE command. Do not execute it unless you have read through it
and are comfortable you know what it does.

Ready? (hit ctl-C to quit)  
Enter value for owner_name: LIU
old   8: v_owner_name varchar2(30) := upper('&owner_name');
new   8: v_owner_name varchar2(30) := upper('LIU');
Enter value for table_name: TEMP
old   9: v_table_name varchar2(30) := upper('&table_name');
new   9: v_table_name varchar2(30) := upper('TEMP');
Enter value for tablespace: LIU
old  11: v_tablespace varchar2(30) := upper('&tablespace');
new  11: v_tablespace varchar2(30) := upper('LIU');


Saved 13247 records in TEMP_SAVED.
201 bads records in TEMP_BAD.

PL/SQL procedure successfully completed.

SQL> 
SQL> select count(*) from temp_saved;

  COUNT(*)
----------
     13247
     
SQL> select  * from  v$database_block_corruption;

     FILE#     BLOCK#	  BLOCKS CORRUPTION_CHANGE# CORRUPTIO
---------- ---------- ---------- ------------------ ---------
	 4	  258	       1		  0 ALL ZERO
	 4	 1798	       1		  0 ALL ZERO

block 1798 已经被标识为media corruption

手动构造rowid

SQL> select * from temp_bad where rownum<5; 

OLD_ROWID			 OLD_FILE OLD_OBJECT  OLD_BLOCK    OLD_ROW
------------------------------ ---------- ---------- ---------- ----------
ERROR_MESSAGE
--------------------------------------------------------------------------------
4.1798.181			    13646	   4	   1798        181
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'

4.1798.182			    13646	   4	   1798        182
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'

4.1798.183			    13646	   4	   1798        183

OLD_ROWID			 OLD_FILE OLD_OBJECT  OLD_BLOCK    OLD_ROW
------------------------------ ---------- ---------- ---------- ----------
ERROR_MESSAGE
--------------------------------------------------------------------------------
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'

4.1798.184			    13646	   4	   1798        184
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'

SQL> select dbms_rowid.rowid_create(1,13646,4,1798,181) from dual;

DBMS_ROWID.ROWID_C
------------------
AAADVOAAEAAAAcGAC1

SQL> select dbms_rowid.rowid_create(1,13646,4,1798,182) from dual;

DBMS_ROWID.ROWID_C
------------------
AAADVOAAEAAAAcGAC2




SQL> select * from temp where rowid ='AAADVOAAEAAAAcGAC1';
select * from temp where rowid ='AAADVOAAEAAAAcGAC1'
              *
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'


SQL> select * from temp where rowid ='AAADVOAAEAAAAcGAC2';
select * from temp where rowid ='AAADVOAAEAAAAcGAC2'
              *
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'

Reference:
Extracting Data from a Corrupt Table using ROWID Range Scans in Oracle8 and higher [ID 61685.1]
Extracting Data from a Corrupt Table using DBMS_REPAIR or Event 10231 [ID 33405.1]
Extract rows from a CORRUPT table creating ROWID from DBA_EXTENTS [ID 422547.1]