16号早上10点左右发现系统变慢,查看等待事件 发现大量i/o salve wait, load 比值增加,等待语句sql_id为 38wm92jbxu0vf


09:59:30 up 80 days, 8:00, 2 users, load average: 16.02, 4.67, 2.95

SID USERNAME MACHINE EVENT SQL ST LT LOGON_TIME
—— ———- ———————- —————————— —————————- — —— ———-
1753 USER_DATA2 JDBC Thin Client log file sync /cqy9u8yvwhftq A 0 22175
2250 USER_DATA2 JDBC Thin Client SQL*Net more data from client /3s4uqa7n9z9hf I 0 22087
2115 USER_DATA2 JDBC Thin Client SQL*Net more data from client /3s4uqa7n9z9hf I 0 22746
2096 PROD_DATA2 JDBC Thin Client SQL*Net message from client 30hsdmcf9zyhq/34vfybbn2m7bc A 0 27989
835 USER_DATA2 JDBC Thin Client direct path read 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22194
928 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22749
947 USER_DATA2 JDBC Thin Client SQL*Net more data to client 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22682
957 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22754
961 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22751
972 USER_DATA2 JDBC Thin Client SQL*Net message from client 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22172
981 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22177
985 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22174
991 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22172
992 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22176
994 USER_DATA2 JDBC Thin Client i/o slave wait

OSW报告显示系统在9:58-10:00之间经历了大量的io wait,从下面的数据可以清楚反应出来

zzz ***Fri Dec 16 09:58:00 CST 2011
avg-cpu: %user %nice %sys %iowait %idle
3.05 0.00 2.62 0.77 93.57
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
dm-8 0.00 0.00 87.96 11.37 1102.34 168.56 551.17 84.28 12.79 0.11 1.07 0.67 6.66
dm-9 0.00 0.00 31.77 0.33 508.36 5.35 254.18 2.68 16.00 0.02 0.53 0.52 1.67
dm-10 0.00 0.00 190.97 13.71 2654.18 101.67 1327.09 50.84 13.46 0.13 0.62 0.48 9.80
dm-11 0.00 0.00 53.18 4.01 775.92 33.44 387.96 16.72 14.15 0.07 1.24 1.07 6.12
dm-12 0.00 0.00 102.34 1.34 1886.29 10.03 943.14 5.02 18.29 0.09 0.83 0.65 6.69
dm-13 0.00 0.00 66.89 0.33 1124.08 0.33 562.04 0.17 16.73 0.07 1.01 0.78 5.22
dm-14 0.00 0.00 1.00 1.00 1.00 1.00 0.50 0.50 1.00 0.00 1.17 1.17 0.23
dm-15 0.00 0.00 102.01 1.34 1880.94 10.03 940.47 5.02 18.30 0.09 0.83 0.65 6.72

zzz ***Fri Dec 16 09:59:30 CST 2011
avg-cpu: %user %nice %sys %iowait %idle
18.32 0.00 60.82 16.95 3.92

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util

dm-8 0.00 0.00 1610.71 0.40 25771.43 3.17 12885.71 1.59 16.00 16.45 10.25 0.59 94.88
dm-9 0.00 0.00 991.27 0.79 15836.51 12.70 7918.25 6.35 15.98 9.37 9.49 0.94 93.41
dm-10 0.00 0.00 4983.73 0.79 79752.38 19.05 39876.19 9.52 16.00 112.20 23.08 0.22 108.45
dm-11 0.00 0.00 1287.70 0.40 20603.17 6.35 10301.59 3.17 16.00 11.40 8.87 0.76 98.41
dm-12 0.00 0.00 2853.17 0.00 45930.16 0.00 22965.08 0.00 16.10 37.76 13.28 0.36 103.73
dm-13 0.00 0.00 1939.68 6.75 30757.14 29.76 15378.57 14.88 15.82 18.77 9.68 0.50 96.67
dm-14 0.00 0.00 1.19 1.19 1.19 1.19 0.60 0.60 1.00 0.02 8.17 8.17 1.94
dm-15 0.00 0.00 2853.17 0.00 45930.16 0.00 22965.08 0.00 16.10 37.80 13.28 0.36 103.89

整体负载在9:59左右飙升,sys%比例增加,idle%接近于0 wait严重 cpu严重不足

CGet Phys Log LogC CPU Redo Execs HParse Parse Enq Load User% Sys% Idle% %Wait Gc_Cr_Req Gc_Cr_Sen Gc_Ccr_Req Gc_Ccr_Sen
12-16 09:57:56 376 7 0 2 921 7 1856 0 1850 11 0 2 2 96 0 82 222 59 471
12-16 09:58:06 110 4 0 0 634 466 3003 23 3002 4 1 1 1 97 0 131 180 340 646
12-16 09:58:17 99 6 0 3 608 25 1465 0 1465 9 1 2 1 97 0 79 102 56 415
12-16 09:58:27 96 5 0 1 541 131 1419 0 1414 3 1 2 1 97 0 109 101 122 705
12-16 09:58:37 148 6 0 2 799 121 5081 5 5067 6 1 2 1 96 0 159 135 138 407
12-16 09:58:47 137 3 0 1 490 83 2138 0 2137 4 1 3 1 95 0 99 191 129 309
12-16 09:58:57 367 5 0 1 825 77 1483 0 1477 12 1 1 1 97 0 131 175 99 348
12-16 09:59:07 147 18 0 1 1248 93 2617 5 2617 5 1 6 9 84 1 114 145 122 554
12-16 09:59:18 358 86 0 6 7844 193 3197 1 3153 7 3 17 43 9 31 96 75 123 681
12-16 09:59:28 412 101 0 0 12639 197 3497 2 3492 4 6 17 54 4 24 113 208 167 397
12-16 09:59:38 489 117 0 2 13986 75 3922 0 3923 4 27 17 55 6 22 115 208 96 451
12-16 09:59:48 695 116 0 1 14102 30 4477 3 4476 5 30 18 63 5 15 80 171 93 154
12-16 09:59:58 443 120 0 2 13974 31 3545 0 3541 13 36 18 60 6 16 63 63 68 455
12-16 10:00:09 2083 117 2 3 14324 1337 6947 36 6949 3 35 28 56 5 11 3628 722 1424 868
12-16 10:00:19 1135 117 28 34 13526 698 9824 9 9855 6 34 23 56 2 19 301 1008 499 675
12-16 10:00:29 1132 107 8 10 14043 1228 12309 2 12314 19 36 26 52 3 19 1238 1137 910 997
12-16 10:00:39 724 73 22 25 9066 2620 11982 5 11920 21 34 6 1 91 1 505 514 1406 1002
12-16 10:00:49 686 11 -60 1 2597 2214 26911 4 26908 9 27 10 3 86 2 412 780 669 1110
12-16 10:00:59 1118 11 0 3 3106 513 23849 6 23847 16 24 14 4 81 1 638 586 650 956
12-16 10:01:10 664 7 0 0 2039 1498 23106 3 23106 38 21 5 2 92 1 541 349 612 843

查看当时的awr报告:

Top 5 Timed Events

Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
direct path read 1,335,925 4,857 4 191.7 User I/O
CPU time   2,260   89.2  
SQL*Net more data to client 4,567,337 380 0 15.0 Network
SQL*Net more data from client 72,356 118 2 4.7 Network
log file sequential read 28,784 61 2 2.4 System I/O

发现大量io等待事件

Time Model Statistics


  • Wait Class


  • Wait Events


  • Background Wait Events


  • Operating System Statistics


  • Service Statistics


  • Service Wait Class Stats

    Back to Top

    Time Model Statistics

    • Total time in database user-calls (DB Time): 2533.6s
    • Statistics including the word “background” measure background process
      time, and so do not contribute to the DB time statistic

    • Ordered by % or DB time desc, Statistic name

    Statistic Name Time (s) % of DB Time
    DB CPU 2,260.24 89.21
    sql execute elapsed time 891.84 35.20
    parse time elapsed 35.13 1.39
    hard parse elapsed time 4.79 0.19
    hard parse (sharing criteria) elapsed time 2.34 0.09
    connection management call elapsed time 1.30 0.05
    sequence load elapsed time 0.73 0.03
    PL/SQL execution elapsed time 0.36 0.01
    hard parse (bind mismatch) elapsed time 0.04 0.00
    failed parse elapsed time 0.01 0.00
    repeated bind elapsed time 0.00 0.00
    DB time 2,533.59  
    background elapsed time 478.83  
    background cpu time 120.28  

    sql execute elapsed time 所占比重极少,db_time被花费在大量的等待上。

    导致system在等待什么,为何sys%突然飙升,大量的i/o slave wait 又从何而来,一切仅仅1分钟左右的事件,我们只能从event_wait着手

    sys@pri> select sql_fulltext from v$sql where sql_id = ’38wm92jbxu0vf’;

    no rows selected

    查看38wm92jbxu0vf并未查到任何sql_text语句

    SQL> @sample sql_text v$sqlarea “sql_id = ’38wm92jbxu0vf'” 1000

    no rows selected

    SQL> — doesn’t look like we’ll ever catch it in the shared pool

    同样没有任何收获,难道在shared_pool中不存在cursor?

    但是我们却从v$session中发现了线索

    sys@pri> select username, sid, serial#, status, sql_id, event, seconds_in_wait
    from v$session
    where username like nvl(‘&username’,username)
    and sid like nvl(‘&sid’,sid)
    and sql_id=’38wm92jbxu0vf’
    order by username, sid, serial#; 2 3 4 5 6
    Enter value for username:
    old 3: where username like nvl(‘&username’,username)
    new 3: where username like nvl(”,username)
    Enter value for sid:
    old 4: and sid like nvl(‘&sid’,sid)
    new 4: and sid like nvl(”,sid)

    USERNAME SID SERIAL# STATUS SQL_ID EVENT SECONDS_IN_WAIT
    ——————– ———- ———- ——– ————- —————————— —————
    USER_DATA2 1664 6 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51004
    USER_DATA2 1665 6 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51002
    USER_DATA2 1666 7 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51004
    USER_DATA2 1668 5 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51002
    USER_DATA2 1669 5 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51002
    USER_DATA2 1671 8 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51001
    USER_DATA2 1672 5 INACTIVE 38wm92jbxu0vf SQL*Net message from client 10549
    USER_DATA2 1673 5 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51001
    USER_DATA2 1674 5 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51002
    USER_DATA2 1676 5 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51001

    根据我们的猜想,session并没有使用 shared_pool中的cursor,而是使用了cached_cursor


    sys@pri> select distinct * from v$open_cursor
    where rownum < 10 and sql_id ='38wm92jbxu0vf'; 2 3 SADDR SID USER_NAME ADDRESS HASH_VALUE SQL_ID SQL_TEXT ---------------- ---------- ------------------------------ ---------------- ---------- ------------- ------------------------------------------------------------ 0000001054723E30 2241 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0 0000001058705460 2135 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0 00000010546A8710 1689 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0 00000010586FE908 2105 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0 00000010586E10B8 1973 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0 000000105F7094A0 2140 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0 00000010586F6838 2069 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0 000000105F6DE5A0 1948 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0 00000010546D6100 1893 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0

    正如我们所见sql_text中包含了 table_4_9_4ba92_0_0_0, 根据tom所提供的信息:asktom.oracle.com

    sys@pri> select owner, object_name, object_type
    from dba_objects
    where object_id = (select to_number(‘&hex_value’,’XXXXXX’) from dual); 2 3
    Enter value for hex_value: 4ba92
    old 3: where object_id = (select to_number(‘&hex_value’,’XXXXXX’) from dual)
    new 3: where object_id = (select to_number(‘4ba92′,’XXXXXX’) from dual)

    OWNER OBJECT_NAME OBJECT_TYPE
    —————————— ——————————————————————————————————————————– ——————-
    USER_DATA2 GROUPON_ACTIVITY TABLE

    查到了GROUPON_ACTIVITY这张表 我们来看看GROUPON_ACTIVITY这张表的结构:

    sys@pri> desc user_data2.GROUPON_ACTIVITY;


    Name Type
    —————————– ————–
    ID NUMBER(18)
    NAME VARCHAR2(900)
    START_TIME DATE
    END_TIME DATE
    PRODUCT_CODE VARCHAR2(10)
    PRICE NUMBER(11,2)
    PEOPLE_LOWER NUMBER(9)
    PEOPLE_UPPER NUMBER(9)
    SELLING_POINT NCLOB
    PROMPT NCLOB
    NOTE VARCHAR2(1000)
    STATUS NUMBER(9)

    包含了两个NCLOB列,验证了tom的判断

    那么我们如何找出引起io的sql语句呢,通过上述分析,我们得知SQL通过access一个lob列产生了一个新的sql_id,那么必然会产生一个PREV_SQL_ID 我们是否可以从这里着手呢?

    sys@pri> select username, sid, serial#, status, sql_id, prev_sql_id, event, seconds_in_wait
    from v$session
    where username like nvl(‘&username’,username)
    and sid like nvl(‘&sid’,sid)
    and sql_id=’38wm92jbxu0vf’
    order by username, sid, serial#; 2 3 4 5 6
    Enter value for username:
    old 3: where username like nvl(‘&username’,username)
    new 3: where username like nvl(”,username)
    Enter value for sid:
    old 4: and sid like nvl(‘&sid’,sid)
    new 4: and sid like nvl(”,sid)

    USERNAME SID SERIAL# STATUS SQL_ID PREV_SQL_ID EVENT SECONDS_IN_WAIT
    —————————— ———- ———- ——– ————- ————- —————————————————————- —————
    USER_DATA2 1664 6 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49242
    USER_DATA2 1665 6 INACTIVE 38wm92jbxu0vf b0d46gq8gg0t0 SQL*Net message from client 49240
    USER_DATA2 1666 7 INACTIVE 38wm92jbxu0vf 8kumczaskxmmk SQL*Net message from client 49242
    USER_DATA2 1668 5 INACTIVE 38wm92jbxu0vf b0d46gq8gg0t0 SQL*Net message from client 49240
    USER_DATA2 1669 5 INACTIVE 38wm92jbxu0vf b0d46gq8gg0t0 SQL*Net message from client 49240
    USER_DATA2 1670 5 INACTIVE 38wm92jbxu0vf 6at2bn8hzgwf3 SQL*Net message from client 339
    USER_DATA2 1671 8 INACTIVE 38wm92jbxu0vf b0d46gq8gg0t0 SQL*Net message from client 49239
    USER_DATA2 1672 5 INACTIVE 38wm92jbxu0vf 6at2bn8hzgwf3 SQL*Net message from client 8787
    USER_DATA2 1673 5 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
    USER_DATA2 1674 5 INACTIVE 38wm92jbxu0vf 0hbv174zz7ksa SQL*Net message from client 49240
    USER_DATA2 1676 5 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
    USER_DATA2 1677 132 INACTIVE 38wm92jbxu0vf 8kumczaskxmmk SQL*Net message from client 49242
    USER_DATA2 1684 1772 INACTIVE 38wm92jbxu0vf dbgvdpt1rwvm2 SQL*Net message from client 49239
    USER_DATA2 1686 109 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49240
    USER_DATA2 1687 427 INACTIVE 38wm92jbxu0vf 804m3bb9zh20j SQL*Net message from client 3
    USER_DATA2 1688 5791 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
    USER_DATA2 1689 2114 INACTIVE 38wm92jbxu0vf 8kumczaskxmmk SQL*Net message from client 49240
    USER_DATA2 1691 9051 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49240
    USER_DATA2 1692 12031 INACTIVE 38wm92jbxu0vf 6ryuz4q7wqnd3 SQL*Net message from client 49240
    USER_DATA2 1693 2288 INACTIVE 38wm92jbxu0vf 8kumczaskxmmk SQL*Net message from client 49242
    USER_DATA2 1694 2087 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
    USER_DATA2 1697 5111 INACTIVE 38wm92jbxu0vf b0d46gq8gg0t0 SQL*Net message from client 49240
    USER_DATA2 1698 102 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
    USER_DATA2 1700 560 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49242
    USER_DATA2 1701 934 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
    USER_DATA2 1702 10471 INACTIVE 38wm92jbxu0vf ar18k5pwfsjdg SQL*Net message from client 2896
    USER_DATA2 1703 4347 INACTIVE 38wm92jbxu0vf 6at2bn8hzgwf3 SQL*Net message from client 49242
    USER_DATA2 1705 98 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49240
    USER_DATA2 1707 2957 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
    USER_DATA2 1709 2055 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49240
    USER_DATA2 1710 470 INACTIVE 38wm92jbxu0vf 0hbv174zz7ksa SQL*Net message from client 49242

    我们终于有了发现,通过对PREV_SQL_ID的逐一查找 我们终于抓出了这个SQL:

    SELECT g.*,ga.area_id,
    CASE
    WHEN g.start_time > SYSDATE THEN
    ‘1’
    WHEN g.end_time < SYSDATE THEN '-1 ' ELSE '0' END AS TYPE FROM xx g INNER JOIN yy ga ON g.id = ga.groupon_id WHERE ga.area_id = :1 AND g.i d in (:2, :3, :4, :5, :6, ...)

    跟进17号同一时段的system负载:

    CGet Phys Log LogC CPU Redo Execs HParse Parse Enq Load User% Sys% Idle% %Wait Gc_Cr_Req Gc_Cr_Sen Gc_Ccr_Req Gc_Ccr_Sen

    12-17 09:59:04 2 0 0 2 35 4 157 0 156 1 0 0 1 99 0 14 27 11 51
    12-17 09:59:14 5 0 0 1 51 26 463 0 458 2 0 1 2 97 0 16 76 42 57
    12-17 09:59:25 6 0 0 11 104 21 967 0 977 1 0 1 2 96 0 26 30 25 52
    12-17 09:59:35 26 0 1 9 146 46 1467 1 1471 9 0 3 3 94 0 255 56 67 52
    12-17 09:59:45 8 0 -1 11 126 10 1908 1 1914 1 0 2 3 95 0 14 35 16 86
    12-17 09:59:55 11 0 0 2 115 91 2464 2 2465 0 0 1 3 96 0 33 111 78 53
    12-17 10:00:05 1103 0 0 12 1249 289 3117 4 3125 2 0 10 3 86 0 739 336 794 332
    12-17 10:00:15 1162 0 14 28 2632 1055 3414 5 3357 22 1 7 5 88 1 352 550 634 498
    12-17 10:00:26 565 0 5 7 1302 319 4929 2 4936 3 1 3 4 93 0 175 264 245 239
    12-17 10:00:36 376 0 9 20 622 200 4151 2 4165 10 1 3 3 94 0 175 317 187 315
    12-17 10:00:46 331 0 -28 0 578 313 4337 1 4331 2 1 3 3 94 0 153 270 302 322
    12-17 10:00:56 598 0 0 0 756 256 5204 8 5203 2 1 5 2 93 0 160 280 233 284
    12-17 10:01:06 730 0 0 8 843 628 2955 56 2964 3 1 3 2 95 0 380 291 621 741

    总结

    一张含有nclob字段表的查询语句,由于SQL在读取lob字段的时候采用direct path的方式,并且access LOB字段所产生的sql_id特殊性,迷惑了我们很久,
    同时由于这个APP系统的特殊性,大量的瞬间查询导致了system iowait等待,在awr里反映出大量的direct path read等待,伴随着大量的Network等待,这些都是逐一显现出来的
    同时由于大量的io请求,导致cpu的繁忙,从而导致了idle的急速下降,这些都是接二连三所出现的反映。最后通过cache这些缓存到应用端 从而解决了问题。