前天处理了一启RAC ASM 内存泄露问题,系统为公司最后一套RAC one node 版本为11.2.0.3

这套核心RAC one node系统出现大量 log file switch buffer busy waits等待,并伴随大量cursor: pin S wait on X read by other session。

SID USERNAME   MACHINE                EVENT                          PARAM                   W   WT SQL                          ST     LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
 10917 PROD_DATA2 JDBC Thin Client       buffer busy waits              14/907008/8             0  294 /bgx0f86ucnw8v               A     355       7628
 13549 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  172 /520mkxqpf15q8               A     294      20514
  4456 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 6rsq14x8a7n5r/dsxmfbstazq7u  A     525      94053
  4521 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 6rsq14x8a7n5r/dsxmfbstazq7u  A     508      94052
  6831 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 1k8vrykzkvccg/d5dfm29f41s0d  A     739       1400
  3170 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:12:  1397817350/21363/0      0  391 4uj3xtfzq3a91/gky4m8njtfrcd  A     392     536146
  4611 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  181 fqc7yvkzxjzk7/cqb6f9fkfkcsq  A     725        932
 13448 PROD_DATA2 JDBC Thin Client       buffer busy waits              56/645165/1             0  730 fqc7yvkzxjzk7/cqb6f9fkfkcsq  A     730        736
   940 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  181 fqc7yvkzxjzk7/cqb6f9fkfkcsq  A     710        953
  8026 PROD_DATA2 JDBC Thin Client       buffer busy waits              20/279682/8             0  441 3f0thwr0rg23c/c4h5dg81ffxyr  A     441      94043
  9290 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  178 3f0thwr0rg23c/c4h5dg81ffxyr  A     220        473
  2750 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  199 b3dmt8325pmvf/bpk6g015uzqzt  A     417      81687
  3510 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  223 c5n6d774as72b/5ak0xdkuvc65n  A     273     536146
  2323 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  172 c5n6d774as72b/5ak0xdkuvc65n  A     407     536140
  3767 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35200/8              0  725 5q60jbr6700u7/g2th2agwfkk1t  A     725       1133
 10205 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35200/8              0  717 5q60jbr6700u7/g2th2agwfkk1t  A     717       1186
 10046 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35200/8              0  670 5q60jbr6700u7/g2th2agwfkk1t  A     670       1186
 10044 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35200/8              0  689 5q60jbr6700u7/g2th2agwfkk1t  A     689      24971
  6395 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35201/8              0  693 5q60jbr6700u7/g2th2agwfkk1t  A     693       1522
  9535 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35200/8              0  703 5q60jbr6700u7/g2th2agwfkk1t  A     703      42595
  6917 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 5q60jbr6700u7/g2th2agwfkk1t  A     709       1221
  2902 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35200/8              0  700 5q60jbr6700u7/g2th2agwfkk1t  A     700       2146
  9554 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 5q60jbr6700u7/g2th2agwfkk1t  A     726       2147
 11662 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:73:  1397817350/21911/0      0  377 cg3urj768yv8g/apun8g5mb9byt  A     378      18995
  9378 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 cg3urj768yv8g/7g63srczfgfd6  A     627      19057
  8698 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:73:  1397817350/21911/0      0  551 cg3urj768yv8g/5smgjxsscf91w  A     551      18996
  7575 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:0:   1397817350/21911/0      0  204 cg3urj768yv8g/793hdqw3bn04q  A     204      19035

...
 13028 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:0:   1397817350/21911/0      0  623 cg3urj768yv8g/8s97vpbnuktp2  A     623      19061

...
  6659 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:0:   1397817350/21911/0      0  615 cg3urj768yv8g/8s97vpbnuktp2  A     616      19035


  5187 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:56:  1397817350/21911/0      0  418 cg3urj768yv8g/793hdqw3bn04q  A     418      19069
  4933 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:0:   1397817350/21911/0      0  576 cg3urj768yv8g/9quv2rcpqjxwt  A     576      18977


  4346 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:0:   1397817350/21911/0      0  219 cg3urj768yv8g/8s97vpbnuktp2  A     546        943

....
  4082 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:14:  1397817350/21911/0      0  460 cg3urj768yv8g/520mkxqpf15q8  A     460       1522
  7267 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:73:  1397817350/21911/0      0  415 cg3urj768yv8g/8s97vpbnuktp2  A     415        919
  9532 PROD_DATA2 JDBC Thin Client       DFS lock handle                1398145029/21916/0      0  353 2xw94b38cybhv/4001f9ha480pb  A     354     358380
  9716 PROD_DATA2 JDBC Thin Client       DFS lock handle                1398145029/21916/0      0  264 2xw94b38cybhv/4001f9ha480pb  A     265       2147
   512 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 fhd852bb8agv6/gq6xnzt1spkzp  A     654     533810
  7001 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  178 4q0c1a3bsynyw/guwq1fpq9r1dy  A     261     536140
 11348 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 d3wp3qvbz5bna/1w0c0jhzmp1hh  A     226       7299
 11164 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 9fwybsrc3h1dd/70k37x7hwg0w5  A     701       1185
  3096 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  220 9fwybsrc3h1dd/70k37x7hwg0w5  A     510       1522
  8100 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  181 91r0dubcptymp/dsxmfbstazq7u  A     649      95706
  8192 PROD_DATA2 JDBC Thin Client       buffer busy waits              24/274502/1             0  645 91r0dubcptymp/dsxmfbstazq7u  A     645      94041
  8783 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 91r0dubcptymp/dsxmfbstazq7u  A     664      94039
  6667 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 91r0dubcptymp/dsxmfbstazq7u  A     685       1272
  5445 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 91r0dubcptymp/dsxmfbstazq7u  A     688       1185
    38 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  199 91r0dubcptymp/dsxmfbstazq7u  A     695       1263
  9386 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  181 91r0dubcptymp/dsxmfbstazq7u  A     685      95693
  8868 PROD_DATA2 JDBC Thin Client       buffer busy waits              53/118272/8             0  675 ghy358rghc4ty/3p3z1j33tk594  A     675     486608
  6475 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 ghy358rghc4ty/3p3z1j33tk594  A     706     486608
  6899 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  178 ftmka9gnhru7t/1w0c0jhzmp1hh  A     738     442302
  4003 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 8a5g4frpyags9/1w0c0jhzmp1hh  A     401     435103

   SID USERNAME   MACHINE                EVENT                          PARAM                   W   WT SQL                          ST     LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
   861 PROD_DATA2 dcb-srv-0343-vm02      buffer busy waits              60/1078781/1            0  504 fjnq76rpzzpdy/a6tzf0tjrwvqh  A     505      20966
 12361 PROD_DATA2 xen-4-211-vm05         buffer busy waits              60/1078781/1            0  402 fjnq76rpzzpdy/a6tzf0tjrwvqh  A     403      20976
  6213 TOAD       JDBC Thin Client       buffer busy waits              41/34704/571            0  542 5kjukxbrv5w3r/dbkpyhyfyvdpu  A     542    1577749
  7513 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 cvvzaj3syc76b/1w0c0jhzmp1hh  A     703     502831
  8015 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 1tg6ux7vc22zm/32bjxyxc0xwfs  A     463        899
 10817 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  205 1tg6ux7vc22zm/32bjxyxc0xwfs  A     720      18977
 12268 PROD_DATA2 JDBC Thin Client       buffer busy waits              57/1302268/1            0  283 3szwvvzw5aq13/bpaj7frdt63mx  A     284      82802
  5959 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  178 3szwvvzw5aq13/bpaj7frdt63mx  A     583      75654
 10975 PROD_DATA2 JDBC Thin Client       read by other session          27/223916/1             0    2 2b47rbdt23jp8/a4td25ftgtqt8  A      40         51
 10994 PROD_DATA2 JDBC Thin Client       read by other session          27/223916/1             0    2 2b47rbdt23jp8/459f3z9u4fb3u  A      34         34
 10999 PROD_DATA2 JDBC Thin Client       read by other session          27/223916/1             0    2 2b47rbdt23jp8/a4td25ftgtqt8  A      30         30
 11015 PROD_DATA2 JDBC Thin Client       read by other session          27/223916/1             0    2 2b47rbdt23jp8/a4td25ftgtqt8  A       8          9
  4688 PROD_DATA2 JDBC Thin Client       read by other session          20/1806690/1            0    2 4gzs73jtmhyd0/f2n7vrpx4v30v  A       4         10
  4626 PROD_DATA2 JDBC Thin Client       db file sequential read        27/630098/1             0    0 11n7q3tu19vvp/adghpykwhquu2  A       1         23
 10566 PROD_DATA2 JDBC Thin Client       db file sequential read        57/780833/1             0    0 50731ypu4mu31/adghpykwhquu2  A       1         34
 12527 PROD_DATA2 xen-21205-vm02         db file sequential read        28/168522/1             0    0 7szz51xufmuc4/520mkxqpf15q8  A       5         33
  5489 PROD_DATA2 JDBC Thin Client       read by other session          60/931838/1             0    1 1zbqmb9ukkhpd/459f3z9u4fb3u  A      13         13
  5063 PROD_DATA2 JDBC Thin Client       db file sequential read        17/871948/1             0    2 g8jjuy5wt7h92/64yswjj6cdv5j  A      12         16

   SID USERNAME   MACHINE                EVENT                          PARAM                   W   WT SQL                          ST     LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
 10302 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2044424941/278013233    0   44 42zra6dwxqwrd/cnm20xg9c333h  A      44         44
                                                                        07008/21474836480

 11330 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2044424941/278013233    0   15 42zra6dwxqwrd/523rw17fc75fb  A      14         38
                                                                        07008/21474836480
.....
 11698 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2044424941/278013233    0    2 42zra6dwxqwrd/459f3z9u4fb3u  A       1          2
                                                                        07008/21474836480

  6473 PROD_DATA2 JDBC Thin Client       read by other session          6/4044/1                0    0 42zra6dwxqwrd/gzcbp0razp0vn  A      44         46
 10136 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2044424941/278013233    0   25 42zra6dwxqwrd/523rw17fc75fb  A      25         41                                                                        07008/21474836480                                                                        36320/21474836480
  3669 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2179751475/452904301    0    3 2vncw4q0ysrjm/g6btczrdb4fa1  A       3         47
                                                                       36320/21474836480
  4854 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2179751475/452904301    0    3 2vncw4q0ysrjm/g6btczrdb4fa1  A       3         52
                                                                        36320/21474836480
   265 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2179751475/452904301    0    3 2vncw4q0ysrjm/g6btczrdb4fa1  A       3         50
                                                                        36320/21474836480

  9355 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2179751475/452904301    0    3 2vncw4q0ysrjm/g6btczrdb4fa1  A       3         52
 ......
 
                                                                        36320/21474836480

4764 rows selected.

主要原因archive进程无法归档。ASM无法分配新的文件标识。

ASMCMD [+arch/item/archivelog/2012_10_19] > cp thread_1_seq_3.257.797108745 aaaaa.test;         
ASMCMD-8012: can not determine file type for file
ORA-00569: Failed to acquire global enqueue. (DBD ERROR: OCIStmtExecute)

结合ASM trace log:

Wed Jun 05 07:39:52 2013
Dumping diagnostic data in directory=[cdmp_20130605073952], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=412241].
Wed Jun 05 07:45:58 2013
Dumping diagnostic data in directory=[cdmp_20130605074557], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=412242].
Wed Jun 05 07:52:44 2013
Dumping diagnostic data in directory=[cdmp_20130605075244], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=412243].
Wed Jun 05 07:57:58 2013
Dumping diagnostic data in directory=[cdmp_20130605075758], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=412244].
Wed Jun 05 08:03:59 2013
Dumping diagnostic data in directory=[cdmp_20130605080359], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=412245].
Wed Jun 05 08:36:09 2013
SQL> /* ASMCMD */alter diskgroup /*ASMCMD*/ "data" add directory '+data/arch' 
SUCCESS: /* ASMCMD */alter diskgroup /*ASMCMD*/ "data" add directory '+data/arch'
Wed Jun 05 08:42:40 2013
Dumping diagnostic data in directory=[cdmp_20130605084240], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=463685].
Wed Jun 05 08:47:47 2013
Dumping diagnostic data in directory=[cdmp_20130605084746], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=463686].
Wed Jun 05 08:52:18 2013
NOTE: ASM client item_1:item disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/grid/11.2.0/diag/asm/+asm/+ASM1/trace/+ASM1_ora_25588.trc

早上07:39 ASM1 instance 出现问题 导致arch进程无法完成归档 进而导致后续的等待。

(DESCRIPTION=(ADDRESS=(PROTOCOL=beq)(PROGRAM=/u01/oracle/11.2.0/oracle/product/db_1/bin/oracle)(ARGV0=oracleitem_1)(ENVS='ORACLE_HOME=/u01/oracle/11.2.0/oracle/product/db_1,ORACLE_SID=item_1,LD_LIBRARY_PATH=')(ARGS='(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))'))(CONNECT_DATA=(SID=item_1)))
2013-06-05 07:39:53.587: [ora.item.db][1607948608] {1:21407:34683} [check] InstConnection::connectInt: server not attached
2013-06-05 07:39:53.717: [ USRTHRD][1607948608] {1:21407:34683} InstConnection:~InstConnection: this 1fc95100
2013-06-05 07:39:53.718: [ora.item.db][1607948608] {1:21407:34683} [check] DbAgent::isDgbHandleFO } status = 16646

Wed Jun 05 07:39:52 2013
ARC3: Error 19504 Creating archive log file to '+ARCH'
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance item_1 - Archival Error
ORA-16038: log 7 sequence# 59248 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 7 thread 1: '+DATA/item/onlinelog/group_7.322.797746361'

For ASM2 instance:

Wed Jun 05 07:39:51 2013
Errors in file /u01/grid/11.2.0/diag/asm/+asm/+ASM2/trace/+ASM2_lmd0_3061.trc (incident=412241):
ORA-04031: unable to allocate 3768 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","ges enqueues")
Incident details in: /u01/grid/11.2.0/diag/asm/+asm/+ASM2/incident/incdir_412241/+ASM2_lmd0_3061_i412241.trc

GES resource已经无法在shared_pool 中初始化,对于这个问题还是比较少见的 针对ASM instance的shared_pool 4031,在11.2.0.3之后
oracle的建议值为memory_target =1536MB 针对这个参数oracle也没有放出相关的最佳时间文档,相信在下一个release的时候oracle会改变
memory_target的default value .目前默认为:

SQL> show parameter target

NAME				     TYPE			       VALUE
------------------------------------ --------------------------------- ------------------------------
memory_max_target		     big integer		       1084M
memory_target			     big integer		       1084M
pga_aggregate_target		     big integer		       0
sga_target			     big integer		       0

由于ASM中的memory 同样要负责GES GCS GRD 等资源的分配,memory fragment是一个不可忽视的问题,只是对于这个case而言大家极少的关注到
ASM 端的4031问题,针对于ASM在11.2.0.3(11.2.0.3 changed some default behaviors)的表现,oracle也应该意识到了这个问题。可以参考如下
文档获取一些有用信息。1536M 应该是oracle确定的一个最小安全值 这个也是得到了研发的认可,没有设置此参数的同学需要注意。

另外针对这个RAC 即使架构为RAC ONE NODE oracle在一些resource control方面仍然是常规RAC的行为,而RAC node NODE 也仅仅是减少了DB端的
一些资源争用。

针对这个case 可以做如下工作:

1 针对11gR2 设置 ASM memory_target 到一个安全的值

2 针对长时间running的database machines 需要做定期重启工作。这次出现问题的机器连续跑了220多天。

3 监控asm log 即使ASM 在11g中的稳定性已经得到了很大的加强。