Some excellent experiment From Honcho

为了保护disks 上的数据,Oracle ASM 对disk groups 有3种冗余方式:
(1)external redundancy 表示Oracle不帮你管理镜像,功能由外部存储系统实现,比如通过RAID技术。
(2)normal redundancy(默认方式)表示Oracle提供2路镜像来保护数据。
(3)high redundancy 表示Oracle提供3路镜像来保护数据。

如果使用ASM的冗余,就是通过 ASM failure group 来实现。ASM使用的镜像算法并不是镜像整个disk,而是作extent级的镜像。所以如果为各个failure group使用不同容量的disk是不明智的,因为这样在Oracle分配下一个extent的时候可能就会出现问题。
在normal redundancy模式下,ASM环境中每分配一个extent都会有一个primary copy和一个secondary copy,ASM的算法保证了secondary copy和primary copy一定是在不同的failure group中,这就是failure group的意义。通过这个算法,ASM保证了即使一个failure group中的所有disk都损坏了,数据也不会丢失。
Oracle在分配extent的时候,所有failure group中的这个将拥有相同数据的extent称为一个extent set,当Oracle将数据写入文件的时候,primary copy可能在任何一个failure group中,而second copy则在另外的failure group中,当Oracle读取数据的时候,除非是primary copy不可用,否则将优先从primary copy中读取数据,通过这种写入无序,读取有序的算法,Oracle保证了数据读取尽量分布在多个disk中。

那么在ASM 内部是如果处理块损坏的呢?Oracle 文档给的描述如下:

A) Block Corruption in Primary Extent
Block reads from disk is affected. Corruption will be found during read from primary extent, we report an ORA-1578 in the ASM alert.log and do a re-read of the same block (primary extent). If the second read fails as well, we read the block from the secondary extent (on a different physical disk) next. This failover to reading from secondary extent happens automatically and transparent for the user/session. If the secondary extent is not corrupt the user will get the results without any error indication other than ORA-1578 about the failed primary extent read in alert.log. Finally ASM will overwrite the corrupt block in the primary extent with the good copy from the secondary. This happens even when the block is unaffected by any transaction. ASM FINDS corrupt blocks in primary extent. That triggers automatically repair at next read.

B) Block Corruption in Secondary Extent
Because block reads always goes to the primary extent first, a corruption in the secondary extent does not have any effect on a read operation. At the next write the block becomes dirty and the new block image will overwrite the old corrupt block on disk. In case of block corruption in the secondary extent there is no diagnostics information available. A corruption in the secondary extent will normally only be seen if the block in the primary extent is also corrupt. ASM fixes corrupt blocks in secondary extent automatically at next write of the block.

下面根据上述描述,做实验验证一下,本实验平台是Oracle 11.2.0.3 for Linux x64

首先创建一组normal redundancy 磁盘组DATA1
SQL> create diskgroup data1 normal redundancy failgroup f1 disk '/dev/asm-diske' failgroup f2 disk '/dev/asm-diskf';
 
Diskgroup created.

创建测试用数据:
SQL> create tablespace xhz datafile '+DATA1' size 100m;
 
Tablespace created.
 
SQL> select name from v$datafile;
 
NAME
--------------------------------------------------------------------------------
+DATA/xhz/datafile/system.260.799281453
+DATA/xhz/datafile/sysaux.261.799281497
+DATA/xhz/datafile/undotbs1.262.799281531
+DATA/xhz/datafile/users.264.799281563
+DATA/xhz/datafile/honcho.266.799334723
+DATA1/xhz/datafile/xhz.256.800304113
 
6 rows selected.
 
SQL> create table honcho (id number,name varchar2(20)) tablespace xhz;
 
Table created.
 
SQL> begin
  2    for i in 1..10000 loop
  3     insert into honcho values (i,'honcho');
  4     end loop;
  5    commit;
  6  end;
  7  /
 
PL/SQL procedure successfully completed.
 
SQL> create index hon_id_idx on honcho(id) nologging;
                                                    
Index created.

查看表HONCHO 在disk上的分布情况:

SQL> select file_id,extent_id,block_id,blocks from dba_extents where segment_name='HONCHO';
 
   FILE_ID  EXTENT_ID   BLOCK_ID     BLOCKS
---------- ---------- ---------- ----------
         6          0        128          8
         6          1        136          8
         6          2        144          8
         6          3        152          8

任意找一条记录:
SQL> select rowid,id from honcho where id=888;
 
ROWID                      ID
------------------ ----------
AAADbBAAGAAAACEAGr        888             
 
SQL> select dbms_rowid.rowid_object(rowid) object_id,     
  2         dbms_rowid.rowid_relative_fno(rowid) file_id, 
  3         dbms_rowid.rowid_block_number(rowid) block_id,
  4         dbms_rowid.rowid_row_number(rowid) num         
  5    from honcho
  6    where id=888;
 
 OBJECT_ID    FILE_ID   BLOCK_ID        NUM
---------- ---------- ---------- ----------
     14017          6        132        427

该记录位于file # 6 BLOCK_ID 在128和136之间,又因为每一个extent大小默认为1AU,本实验1AU=1M,即一个AU最多容纳128个blocks,因此该数据分布在ASM上属于该segment的第二个AU上。
SQL> select group_number,disk_number,name,path from v$asm_disk where group_number=1;
 
GROUP_NUMBER DISK_NUMBER NAME                           PATH
------------ ----------- ------------------------------ ----------------------------------------
           1           1 DATA1_0001                     /dev/asm-diskf
           1           0 DATA1_0000                     /dev/asm-diske
 
SQL> select name from v$datafile where file#=6;
 
NAME
---------------------------------------------------------
+DATA1/xhz/datafile/xhz.256.800304113

从+group/dbname/file_type/tag.file#.incarnation# 可以得出file# 6 在ASM中的file# 为256,通过x$kffxp视图可以查找到数据在ASM磁盘组中的分布情况:

SQL> select  XNUM_KFFXP,  DISK_KFFXP,  AU_KFFXP,  INDX from x$kffxp where group_kffxp=1 and  NUMBER_KFFXP=256;
 
XNUM_KFFXP DISK_KFFXP   AU_KFFXP       INDX
---------- ---------- ---------- ----------
         0          1         52          0 
         0          0         52          1 
         1          0         53          2  ---> primary extent 位于disk 0 (/dev/asm-diske)上
         1          1         53          3  ---> secondary extent 位于disk1 (/dev/asm-diskf)上
         2          1         54          4
         2          0         54          5
         3          0         55          6
         3          1         55          7
         4          1         56          8
         4          0         56          9
         5          0         57         10
       ...         ...       ...        ...
       ...         ...       ...        ...
       100          1        153        200
       100          0        153        201
2147483648          0         82        202
2147483648          1         82        203
2147483648      65534 4294967294        204

从上面的查询可以得知id=888的记录primary extent 位于AU=53 /dev/asm-diske上,secondary extent 位于AU=53 /dev/asm-diskf上。由于1AU=1024K,block_szie=8K,因此53AU=6784 blocks,通过dd命令查看第6788个blocks上的数据记录。

首先查看primary extent上的数据:

[grid@honcho ~]$ dd if=/dev/asm-diske bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 5.5e-05 seconds, 149 MB/s
0000000 a206 0000 0085 0180 df32 0003 0000 0601
        ack   " nul nul enq nul nul soh   2   _ etx nul nul nul soh ack
0000020 b1f9 0000 0001 0000 365d 0000 df15 0003
          y   1 nul nul soh nul nul nul   ]   6 nul nul nak   _ etx nul
0000040 0000 0000 0002 0032 0080 0180 0003 0000
        nul nul nul nul stx nul   2 nul nul nul nul soh etx nul nul nul
0000060 00d6 0000 069f 00c0 0032 000e 21c6 0000
          V nul nul nul  us ack   @ nul   2 nul  so nul   F   ! nul nul
0000100 df32 0003 0000 0000 0000 0000 0000 0000
          2   _ etx nul nul nul nul nul nul nul nul nul nul nul nul nul
0000120 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000140 0000 0000 0100 01c6 ffff 039e 06c9 032b
        nul nul nul nul nul soh   F soh del del  rs etx   I ack   + etx
0000160 032b 0000 01c6 1f8a 1f7c 1f6e 1f60 1f52
          + etx nul nul   F soh  nl  us   |  us   n  us   `  us   R  us
0000200 1f44 1f36 1f28 1f1a 1f0c 1efe 1ef0 1ee2
          D  us   6  us   (  us sub  us  ff  us   ~  rs   p  rs   b  rs
          ...      ...          ...        ...         ...
          ...      ...          ...        ...         ...
0017760 0302 0ec2 0646 6f68 636e 6f68 0601 df32
        stx etx   B  so   F ack   h   o   n   c   h   o soh ack   2   _
0020000

查看secondary extent 上的数据:

[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.009921 seconds, 826 kB/s
0000000 a206 0000 0085 0180 df32 0003 0000 0601
        ack   " nul nul enq nul nul soh   2   _ etx nul nul nul soh ack
0000020 b1f9 0000 0001 0000 365d 0000 df15 0003
          y   1 nul nul soh nul nul nul   ]   6 nul nul nak   _ etx nul
0000040 0000 0000 0002 0032 0080 0180 0003 0000
        nul nul nul nul stx nul   2 nul nul nul nul soh etx nul nul nul
0000060 00d6 0000 069f 00c0 0032 000e 21c6 0000
          V nul nul nul  us ack   @ nul   2 nul  so nul   F   ! nul nul
0000100 df32 0003 0000 0000 0000 0000 0000 0000
          2   _ etx nul nul nul nul nul nul nul nul nul nul nul nul nul
0000120 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000140 0000 0000 0100 01c6 ffff 039e 06c9 032b
        nul nul nul nul nul soh   F soh del del  rs etx   I ack   + etx
0000160 032b 0000 01c6 1f8a 1f7c 1f6e 1f60 1f52
          + etx nul nul   F soh  nl  us   |  us   n  us   `  us   R  us
0000200 1f44 1f36 1f28 1f1a 1f0c 1efe 1ef0 1ee2
          D  us   6  us   (  us sub  us  ff  us   ~  rs   p  rs   b  rs
          ...      ...          ...        ...         ...
          ...      ...          ...        ...         ...
0017760 0302 0ec2 0646 6f68 636e 6f68 0601 df32
        stx etx   B  so   F ack   h   o   n   c   h   o soh ack   2   _
0020000

通过对比可以看出两块盘上dd出来的数据是一样的。
现在首先模拟secondary extent上的数据损坏:
[grid@honcho ~]$ dd if=/dev/zero of=/dev/asm-diskf seek=6788 bs=8K count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 5.5e-05 seconds, 149 MB/s

可以看出现在记录为null:

[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 4.3e-05 seconds, 191 MB/s
0000000 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0020000

此时依旧可以查到数据:

SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 honcho

重启数据库:

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
 
Total System Global Area  546992128 bytes
Fixed Size                  2230192 bytes
Variable Size             167774288 bytes
Database Buffers          373293056 bytes
Redo Buffers                3694592 bytes
Database mounted.
Database opened.
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 honcho

没有任何异常,看secondary extent上是否有数据:

[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6789 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 4.9e-05 seconds, 167 MB/s
0000000 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0020000

此时发现仍然为null,接下来update honcho表:
SQL> update honcho set name='xhz' where id=888;
 
1 row updated.
 
SQL> commit;
 
Commit complete.
 
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 xhz
 
Commit complete.

再次查看:
[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6789 count=1|od -xa

1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 5.2e-05 seconds, 158 MB/s
0000000 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0020000

依旧为NULL,清理一下buffer cache:
SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 xhz

此时发现secondary extent上已经有新数据:

[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6789 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.039857 seconds, 206 kB/s
0000000 a206 0000 0085 0180 f0ca 0003 0000 0601
        ack   " nul nul enq nul nul soh   J   p etx nul nul nul soh ack
0000020 8e2e 0000 0001 0000 365d 0000 f0a0 0003
          .  so nul nul soh nul nul nul   ]   6 nul nul  sp   p etx nul
0000040 0000 0000 0002 0032 0080 0180 0003 0000
        nul nul nul nul stx nul   2 nul nul nul nul soh etx nul nul nul
0000060 00d6 0000 069f 00c0 0032 000e 8000 0000
          V nul nul nul  us ack   @ nul   2 nul  so nul nul nul nul nul
0000100 df32 0003 0003 0003 00d8 0000 06f7 00c0
          2   _ etx nul etx nul etx nul   X nul nul nul   w ack   @ nul
0000120 0032 0035 21c6 0552 f0ca 0003 0000 0000
          2 nul   5 nul   F   !   R enq   J   p etx nul nul nul nul nul
0000140 0000 0000 0100 01c6 ffff 039e 0b73 032b
        nul nul nul nul nul soh   F soh del del  rs etx   s  vt   + etx
0000160 087d 0000 01c6 1f8d 1f82 1f77 1f6c 1f61
          }  bs nul nul   F soh  cr  us stx  us   w  us   l  us   a  us
0000200 1f56 1f4b 1f40 1f35 1f2a 1f1f 1f14 1f09
          V  us   K  us   @  us   5  us   *  us  us  us dc4  us  ht  us
          ...      ...          ...        ...         ...
          ...      ...          ...        ...         ...         
0017760 2c7a 0202 c203 460e 7803 7a68 0601 f0ca
          z   , stx stx etx   B  so   F etx   x   h   z soh ack   J   p
0020000

下面接下来模拟primary extent 上数据损坏:

[grid@honcho ~]$ dd if=/dev/zero of=/dev/asm-diske seek=6788 bs=8K count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 6.3e-05 seconds, 130 MB/s
[grid@honcho ~]$ dd if=/dev/asm-diske bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 4.6e-05 seconds, 178 MB/s
0000000 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0020000

可以看到primary extent 已经没记录了,现在flush 一下 buffer_cache

SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 xhz

从alert日志中可以看到如下信息:
Mon Nov 19 19:09:34 2012
ALTER SYSTEM: Flushing buffer cache
Mon Nov 19 19:09:52 2012
Hex dump of (file 6, block 132) in trace file /u01/app/oracle/diag/rdbms/xhz/xhz/trace/xhz_ora_7588.trc
Corrupt block relative dba: 0x01800085 (file 6, block 132)
Completely zero block found during multiblock buffer read
Reading datafile '+DATA1/xhz/datafile/xhz.256.799758019' for corruption at rdba: 0x01800085 (file 6, block 132)
Read datafile mirror 'DATA1_0000' (file 6, block 132) found same corrupt data (no logical check)
Read datafile mirror 'DATA1_0001' (file 6, block 132) found valid data
Repaired corruption at (file 6, block 132)

从日志中我们看到在查询的时候发现DATA1_0000损坏,于是尝试read secondary extent,最后repaired disk。

[grid@honcho ~]$ dd if=/dev/asm-diske bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.047051 seconds, 174 kB/s
0000000 a206 0000 0085 0180 f0ca 0003 0000 0601
        ack   " nul nul enq nul nul soh   J   p etx nul nul nul soh ack
0000020 8e2e 0000 0001 0000 365d 0000 f0a0 0003
          .  so nul nul soh nul nul nul   ]   6 nul nul  sp   p etx nul
0000040 0000 0000 0002 0032 0080 0180 0003 0000
        nul nul nul nul stx nul   2 nul nul nul nul soh etx nul nul nul
0000060 00d6 0000 069f 00c0 0032 000e 8000 0000
          V nul nul nul  us ack   @ nul   2 nul  so nul nul nul nul nul
0000100 df32 0003 0003 0003 00d8 0000 06f7 00c0
          2   _ etx nul etx nul etx nul   X nul nul nul   w ack   @ nul
0000120 0032 0035 21c6 0552 f0ca 0003 0000 0000
          2 nul   5 nul   F   !   R enq   J   p etx nul nul nul nul nul
0000140 0000 0000 0100 01c6 ffff 039e 0b73 032b
        nul nul nul nul nul soh   F soh del del  rs etx   s  vt   + etx
0000160 087d 0000 01c6 1f8d 1f82 1f77 1f6c 1f61
          }  bs nul nul   F soh  cr  us stx  us   w  us   l  us   a  us
0000200 1f56 1f4b 1f40 1f35 1f2a 1f1f 1f14 1f09
          V  us   K  us   @  us   5  us   *  us  us  us dc4  us  ht  us
          ...      ...          ...        ...         ...
          ...      ...          ...        ...         ...

0017760 2c7a 0202 c203 460e 7803 7a68 0601 f0ca
          z   , stx stx etx   B  so   F etx   x   h   z soh ack   J   p
0020000

当再次查看primary extent 的时候发现已经有记录了。

下面测试将primary extent上和secondary extent上的块都损坏:

[grid@honcho ~]$ dd if=/dev/zero of=/dev/asm-diske seek=6788 bs=8K count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 8.5e-05 seconds, 96.4 MB/s
[grid@honcho ~]$ dd if=/dev/zero of=/dev/asm-diskf seek=6788 bs=8K count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000117 seconds, 70.0 MB/s

查看id=888的记录:
SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select * from honcho where id=888;
select * from honcho where id=888
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 6, block # 132)
ORA-01110: data file 6: '+DATA1/xhz/datafile/xhz.256.800304113'

查询其他的记录
SQL> select * from honcho where id=100;
 
        ID NAME
---------- --------------------
       100 honcho

可以看到file# 6,block #132 上的数据已被彻底损坏,非file #6,block# 132的数据依然可以查到。

查看日志:

Sun Nov 19 19:24:39 2012
ALTER SYSTEM: Flushing buffer cache
Hex dump of (file 6, block 132) in trace file /u01/app/oracle/diag/rdbms/xhz/xhz/trace/xhz_ora_1915.trc
Corrupt block relative dba: 0x01800084 (file 6, block 132)
Completely zero block found during multiblock buffer read
Reading datafile '+DATA1/xhz/datafile/xhz.256.800304113' for corruption at rdba: 0x01800084 (file 6, block 132)
Read datafile mirror 'DATA1_0000' (file 6, block 132) found same corrupt data (no logical check)
Read datafile mirror 'DATA1_0001' (file 6, block 132) found same corrupt data (no logical check)
Errors in file /u01/app/oracle/diag/rdbms/xhz/xhz/trace/xhz_ora_1915.trc  (incident=18163):
ORA-01578: ORACLE data block corrupted (file # 6, block # 132)
ORA-01110: data file 6: '+DATA1/xhz/datafile/xhz.256.800304113'
Incident details in: /u01/app/oracle/diag/rdbms/xhz/xhz/incident/incdir_18163/xhz_ora_1915_i18163.trc
Sun Nov 19 19:24:49 2012
Starting background process SMCO
Sun Nov 19 19:24:50 2012
SMCO started with pid=24, OS id=3119
Corrupt Block Found
         TSN = 6, TSNAME = XHZ
         RFN = 6, BLK = 132, RDBA = 25165956
         OBJN = 14017, OBJD = 14017, OBJECT = HONCHO, SUBOBJECT =
         SEGMENT OWNER = SYS, SEGMENT TYPE = Table Segment
Sun Nov 19 19:24:59 2012
Sweep [inc][18163]: completed
Hex dump of (file 6, block 132) in trace file /u01/app/oracle/diag/rdbms/xhz/xhz/incident/incdir_18163/xhz_m000_3156_i18163_a.trc
Corrupt block relative dba: 0x01800084 (file 6, block 132)
Completely zero block found during validation
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
SunNov 19 19:04:59 2012
Dumping diagnostic data in directory=[cdmp_20121125190459], requested by (instance=1, osid=1915), summary=[incident=18163].
Checker run found 1 new persistent data failures

通过以上实验得出的结论和描述相符。但是这里还是有一个小小的缺陷,如果secondary extent 损坏,随后无脏数据,在这段时间内如果primary extent损坏,还是依旧有数据丢失,所以Oracle 有这样一句描述:

When a disk block becomes corrupt for whatever reason, the corruption can affect blocks in primary, secondary, or both extents. However, the probability that the same block becomes corrupt in both primary and secondary extent at the same time is very low.