16号早上10点左右发现系统变慢,查看等待事件 发现大量i/o salve wait, load 比值增加,等待语句sql_id为 38wm92jbxu0vf
09:59:30 up 80 days, 8:00, 2 users, load average: 16.02, 4.67, 2.95
SID USERNAME MACHINE EVENT SQL ST LT LOGON_TIME
—— ———- ———————- —————————— —————————- — —— ———-
1753 USER_DATA2 JDBC Thin Client log file sync /cqy9u8yvwhftq A 0 22175
2250 USER_DATA2 JDBC Thin Client SQL*Net more data from client /3s4uqa7n9z9hf I 0 22087
2115 USER_DATA2 JDBC Thin Client SQL*Net more data from client /3s4uqa7n9z9hf I 0 22746
2096 PROD_DATA2 JDBC Thin Client SQL*Net message from client 30hsdmcf9zyhq/34vfybbn2m7bc A 0 27989
835 USER_DATA2 JDBC Thin Client direct path read 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22194
928 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22749
947 USER_DATA2 JDBC Thin Client SQL*Net more data to client 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22682
957 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22754
961 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22751
972 USER_DATA2 JDBC Thin Client SQL*Net message from client 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22172
981 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22177
985 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22174
991 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22172
992 USER_DATA2 JDBC Thin Client i/o slave wait 38wm92jbxu0vf/3s4uqa7n9z9hf A 0 22176
994 USER_DATA2 JDBC Thin Client i/o slave wait
OSW报告显示系统在9:58-10:00之间经历了大量的io wait,从下面的数据可以清楚反应出来
zzz ***Fri Dec 16 09:58:00 CST 2011
avg-cpu: %user %nice %sys %iowait %idle
3.05 0.00 2.62 0.77 93.57
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
dm-8 0.00 0.00 87.96 11.37 1102.34 168.56 551.17 84.28 12.79 0.11 1.07 0.67 6.66
dm-9 0.00 0.00 31.77 0.33 508.36 5.35 254.18 2.68 16.00 0.02 0.53 0.52 1.67
dm-10 0.00 0.00 190.97 13.71 2654.18 101.67 1327.09 50.84 13.46 0.13 0.62 0.48 9.80
dm-11 0.00 0.00 53.18 4.01 775.92 33.44 387.96 16.72 14.15 0.07 1.24 1.07 6.12
dm-12 0.00 0.00 102.34 1.34 1886.29 10.03 943.14 5.02 18.29 0.09 0.83 0.65 6.69
dm-13 0.00 0.00 66.89 0.33 1124.08 0.33 562.04 0.17 16.73 0.07 1.01 0.78 5.22
dm-14 0.00 0.00 1.00 1.00 1.00 1.00 0.50 0.50 1.00 0.00 1.17 1.17 0.23
dm-15 0.00 0.00 102.01 1.34 1880.94 10.03 940.47 5.02 18.30 0.09 0.83 0.65 6.72
zzz ***Fri Dec 16 09:59:30 CST 2011
avg-cpu: %user %nice %sys %iowait %idle
18.32 0.00 60.82 16.95 3.92
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
dm-8 0.00 0.00 1610.71 0.40 25771.43 3.17 12885.71 1.59 16.00 16.45 10.25 0.59 94.88
dm-9 0.00 0.00 991.27 0.79 15836.51 12.70 7918.25 6.35 15.98 9.37 9.49 0.94 93.41
dm-10 0.00 0.00 4983.73 0.79 79752.38 19.05 39876.19 9.52 16.00 112.20 23.08 0.22 108.45
dm-11 0.00 0.00 1287.70 0.40 20603.17 6.35 10301.59 3.17 16.00 11.40 8.87 0.76 98.41
dm-12 0.00 0.00 2853.17 0.00 45930.16 0.00 22965.08 0.00 16.10 37.76 13.28 0.36 103.73
dm-13 0.00 0.00 1939.68 6.75 30757.14 29.76 15378.57 14.88 15.82 18.77 9.68 0.50 96.67
dm-14 0.00 0.00 1.19 1.19 1.19 1.19 0.60 0.60 1.00 0.02 8.17 8.17 1.94
dm-15 0.00 0.00 2853.17 0.00 45930.16 0.00 22965.08 0.00 16.10 37.80 13.28 0.36 103.89
整体负载在9:59左右飙升,sys%比例增加,idle%接近于0 wait严重 cpu严重不足
CGet Phys Log LogC CPU Redo Execs HParse Parse Enq Load User% Sys% Idle% %Wait Gc_Cr_Req Gc_Cr_Sen Gc_Ccr_Req Gc_Ccr_Sen
12-16 09:57:56 376 7 0 2 921 7 1856 0 1850 11 0 2 2 96 0 82 222 59 471
12-16 09:58:06 110 4 0 0 634 466 3003 23 3002 4 1 1 1 97 0 131 180 340 646
12-16 09:58:17 99 6 0 3 608 25 1465 0 1465 9 1 2 1 97 0 79 102 56 415
12-16 09:58:27 96 5 0 1 541 131 1419 0 1414 3 1 2 1 97 0 109 101 122 705
12-16 09:58:37 148 6 0 2 799 121 5081 5 5067 6 1 2 1 96 0 159 135 138 407
12-16 09:58:47 137 3 0 1 490 83 2138 0 2137 4 1 3 1 95 0 99 191 129 309
12-16 09:58:57 367 5 0 1 825 77 1483 0 1477 12 1 1 1 97 0 131 175 99 348
12-16 09:59:07 147 18 0 1 1248 93 2617 5 2617 5 1 6 9 84 1 114 145 122 554
12-16 09:59:18 358 86 0 6 7844 193 3197 1 3153 7 3 17 43 9 31 96 75 123 681
12-16 09:59:28 412 101 0 0 12639 197 3497 2 3492 4 6 17 54 4 24 113 208 167 397
12-16 09:59:38 489 117 0 2 13986 75 3922 0 3923 4 27 17 55 6 22 115 208 96 451
12-16 09:59:48 695 116 0 1 14102 30 4477 3 4476 5 30 18 63 5 15 80 171 93 154
12-16 09:59:58 443 120 0 2 13974 31 3545 0 3541 13 36 18 60 6 16 63 63 68 455
12-16 10:00:09 2083 117 2 3 14324 1337 6947 36 6949 3 35 28 56 5 11 3628 722 1424 868
12-16 10:00:19 1135 117 28 34 13526 698 9824 9 9855 6 34 23 56 2 19 301 1008 499 675
12-16 10:00:29 1132 107 8 10 14043 1228 12309 2 12314 19 36 26 52 3 19 1238 1137 910 997
12-16 10:00:39 724 73 22 25 9066 2620 11982 5 11920 21 34 6 1 91 1 505 514 1406 1002
12-16 10:00:49 686 11 -60 1 2597 2214 26911 4 26908 9 27 10 3 86 2 412 780 669 1110
12-16 10:00:59 1118 11 0 3 3106 513 23849 6 23847 16 24 14 4 81 1 638 586 650 956
12-16 10:01:10 664 7 0 0 2039 1498 23106 3 23106 38 21 5 2 92 1 541 349 612 843
查看当时的awr报告:
Top 5 Timed Events
Event | Waits | Time(s) | Avg Wait(ms) | % Total Call Time | Wait Class |
---|---|---|---|---|---|
direct path read | 1,335,925 | 4,857 | 4 | 191.7 | User I/O |
CPU time | 2,260 | 89.2 | |||
SQL*Net more data to client | 4,567,337 | 380 | 0 | 15.0 | Network |
SQL*Net more data from client | 72,356 | 118 | 2 | 4.7 | Network |
log file sequential read | 28,784 | 61 | 2 | 2.4 | System I/O |
发现大量io等待事件
Time Model Statistics
Wait Class
Wait Events
Background Wait Events
Operating System Statistics
Service Statistics
Service Wait Class Stats
Time Model Statistics
- Total time in database user-calls (DB Time): 2533.6s
- Statistics including the word “background” measure background process
time, and so do not contribute to the DB time statistic - Ordered by % or DB time desc, Statistic name
Statistic Name | Time (s) | % of DB Time |
---|---|---|
DB CPU | 2,260.24 | 89.21 |
sql execute elapsed time | 891.84 | 35.20 |
parse time elapsed | 35.13 | 1.39 |
hard parse elapsed time | 4.79 | 0.19 |
hard parse (sharing criteria) elapsed time | 2.34 | 0.09 |
connection management call elapsed time | 1.30 | 0.05 |
sequence load elapsed time | 0.73 | 0.03 |
PL/SQL execution elapsed time | 0.36 | 0.01 |
hard parse (bind mismatch) elapsed time | 0.04 | 0.00 |
failed parse elapsed time | 0.01 | 0.00 |
repeated bind elapsed time | 0.00 | 0.00 |
DB time | 2,533.59 | |
background elapsed time | 478.83 | |
background cpu time | 120.28 |
sql execute elapsed time 所占比重极少,db_time被花费在大量的等待上。
导致system在等待什么,为何sys%突然飙升,大量的i/o slave wait 又从何而来,一切仅仅1分钟左右的事件,我们只能从event_wait着手
sys@pri> select sql_fulltext from v$sql where sql_id = ’38wm92jbxu0vf’;
no rows selected
查看38wm92jbxu0vf并未查到任何sql_text语句
SQL> @sample sql_text v$sqlarea “sql_id = ’38wm92jbxu0vf'” 1000
no rows selected
SQL> — doesn’t look like we’ll ever catch it in the shared pool
同样没有任何收获,难道在shared_pool中不存在cursor?
但是我们却从v$session中发现了线索
sys@pri> select username, sid, serial#, status, sql_id, event, seconds_in_wait
from v$session
where username like nvl(‘&username’,username)
and sid like nvl(‘&sid’,sid)
and sql_id=’38wm92jbxu0vf’
order by username, sid, serial#; 2 3 4 5 6
Enter value for username:
old 3: where username like nvl(‘&username’,username)
new 3: where username like nvl(”,username)
Enter value for sid:
old 4: and sid like nvl(‘&sid’,sid)
new 4: and sid like nvl(”,sid)
USERNAME SID SERIAL# STATUS SQL_ID EVENT SECONDS_IN_WAIT
——————– ———- ———- ——– ————- —————————— —————
USER_DATA2 1664 6 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51004
USER_DATA2 1665 6 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51002
USER_DATA2 1666 7 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51004
USER_DATA2 1668 5 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51002
USER_DATA2 1669 5 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51002
USER_DATA2 1671 8 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51001
USER_DATA2 1672 5 INACTIVE 38wm92jbxu0vf SQL*Net message from client 10549
USER_DATA2 1673 5 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51001
USER_DATA2 1674 5 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51002
USER_DATA2 1676 5 INACTIVE 38wm92jbxu0vf SQL*Net message from client 51001
根据我们的猜想,session并没有使用 shared_pool中的cursor,而是使用了cached_cursor
sys@pri> select distinct * from v$open_cursor
where rownum < 10
and sql_id ='38wm92jbxu0vf'; 2 3
SADDR SID USER_NAME ADDRESS HASH_VALUE SQL_ID SQL_TEXT
---------------- ---------- ------------------------------ ---------------- ---------- ------------- ------------------------------------------------------------
0000001054723E30 2241 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0
0000001058705460 2135 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0
00000010546A8710 1689 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0
00000010586FE908 2105 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0
00000010586E10B8 1973 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0
000000105F7094A0 2140 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0
00000010586F6838 2069 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0
000000105F6DE5A0 1948 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0
00000010546D6100 1893 USER_DATA2 000000105B6DECF0 1474102126 38wm92jbxu0vf table_4_9_4ba92_0_0_0
正如我们所见sql_text中包含了 table_4_9_4ba92_0_0_0, 根据tom所提供的信息:asktom.oracle.com
sys@pri> select owner, object_name, object_type
from dba_objects
where object_id = (select to_number(‘&hex_value’,’XXXXXX’) from dual); 2 3
Enter value for hex_value: 4ba92
old 3: where object_id = (select to_number(‘&hex_value’,’XXXXXX’) from dual)
new 3: where object_id = (select to_number(‘4ba92′,’XXXXXX’) from dual)
OWNER OBJECT_NAME OBJECT_TYPE
—————————— ——————————————————————————————————————————– ——————-
USER_DATA2 GROUPON_ACTIVITY TABLE
查到了GROUPON_ACTIVITY这张表 我们来看看GROUPON_ACTIVITY这张表的结构:
sys@pri> desc user_data2.GROUPON_ACTIVITY;
Name Type
—————————– ————–
ID NUMBER(18)
NAME VARCHAR2(900)
START_TIME DATE
END_TIME DATE
PRODUCT_CODE VARCHAR2(10)
PRICE NUMBER(11,2)
PEOPLE_LOWER NUMBER(9)
PEOPLE_UPPER NUMBER(9)
SELLING_POINT NCLOB
PROMPT NCLOB
NOTE VARCHAR2(1000)
STATUS NUMBER(9)
包含了两个NCLOB列,验证了tom的判断
那么我们如何找出引起io的sql语句呢,通过上述分析,我们得知SQL通过access一个lob列产生了一个新的sql_id,那么必然会产生一个PREV_SQL_ID 我们是否可以从这里着手呢?
sys@pri> select username, sid, serial#, status, sql_id, prev_sql_id, event, seconds_in_wait
from v$session
where username like nvl(‘&username’,username)
and sid like nvl(‘&sid’,sid)
and sql_id=’38wm92jbxu0vf’
order by username, sid, serial#; 2 3 4 5 6
Enter value for username:
old 3: where username like nvl(‘&username’,username)
new 3: where username like nvl(”,username)
Enter value for sid:
old 4: and sid like nvl(‘&sid’,sid)
new 4: and sid like nvl(”,sid)
USERNAME SID SERIAL# STATUS SQL_ID PREV_SQL_ID EVENT SECONDS_IN_WAIT
—————————— ———- ———- ——– ————- ————- —————————————————————- —————
USER_DATA2 1664 6 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49242
USER_DATA2 1665 6 INACTIVE 38wm92jbxu0vf b0d46gq8gg0t0 SQL*Net message from client 49240
USER_DATA2 1666 7 INACTIVE 38wm92jbxu0vf 8kumczaskxmmk SQL*Net message from client 49242
USER_DATA2 1668 5 INACTIVE 38wm92jbxu0vf b0d46gq8gg0t0 SQL*Net message from client 49240
USER_DATA2 1669 5 INACTIVE 38wm92jbxu0vf b0d46gq8gg0t0 SQL*Net message from client 49240
USER_DATA2 1670 5 INACTIVE 38wm92jbxu0vf 6at2bn8hzgwf3 SQL*Net message from client 339
USER_DATA2 1671 8 INACTIVE 38wm92jbxu0vf b0d46gq8gg0t0 SQL*Net message from client 49239
USER_DATA2 1672 5 INACTIVE 38wm92jbxu0vf 6at2bn8hzgwf3 SQL*Net message from client 8787
USER_DATA2 1673 5 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
USER_DATA2 1674 5 INACTIVE 38wm92jbxu0vf 0hbv174zz7ksa SQL*Net message from client 49240
USER_DATA2 1676 5 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
USER_DATA2 1677 132 INACTIVE 38wm92jbxu0vf 8kumczaskxmmk SQL*Net message from client 49242
USER_DATA2 1684 1772 INACTIVE 38wm92jbxu0vf dbgvdpt1rwvm2 SQL*Net message from client 49239
USER_DATA2 1686 109 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49240
USER_DATA2 1687 427 INACTIVE 38wm92jbxu0vf 804m3bb9zh20j SQL*Net message from client 3
USER_DATA2 1688 5791 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
USER_DATA2 1689 2114 INACTIVE 38wm92jbxu0vf 8kumczaskxmmk SQL*Net message from client 49240
USER_DATA2 1691 9051 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49240
USER_DATA2 1692 12031 INACTIVE 38wm92jbxu0vf 6ryuz4q7wqnd3 SQL*Net message from client 49240
USER_DATA2 1693 2288 INACTIVE 38wm92jbxu0vf 8kumczaskxmmk SQL*Net message from client 49242
USER_DATA2 1694 2087 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
USER_DATA2 1697 5111 INACTIVE 38wm92jbxu0vf b0d46gq8gg0t0 SQL*Net message from client 49240
USER_DATA2 1698 102 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
USER_DATA2 1700 560 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49242
USER_DATA2 1701 934 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
USER_DATA2 1702 10471 INACTIVE 38wm92jbxu0vf ar18k5pwfsjdg SQL*Net message from client 2896
USER_DATA2 1703 4347 INACTIVE 38wm92jbxu0vf 6at2bn8hzgwf3 SQL*Net message from client 49242
USER_DATA2 1705 98 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49240
USER_DATA2 1707 2957 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49239
USER_DATA2 1709 2055 INACTIVE 38wm92jbxu0vf db6qwu9sytmf7 SQL*Net message from client 49240
USER_DATA2 1710 470 INACTIVE 38wm92jbxu0vf 0hbv174zz7ksa SQL*Net message from client 49242
我们终于有了发现,通过对PREV_SQL_ID的逐一查找 我们终于抓出了这个SQL:
SELECT g.*,ga.area_id,
CASE
WHEN g.start_time > SYSDATE THEN
‘1’
WHEN g.end_time < SYSDATE THEN
'-1
'
ELSE
'0'
END AS TYPE
FROM xx g
INNER JOIN yy ga
ON g.id = ga.groupon_id
WHERE ga.area_id = :1
AND g.i d in (:2,
:3,
:4,
:5,
:6,
...)
跟进17号同一时段的system负载:
CGet Phys Log LogC CPU Redo Execs HParse Parse Enq Load User% Sys% Idle% %Wait Gc_Cr_Req Gc_Cr_Sen Gc_Ccr_Req Gc_Ccr_Sen
12-17 09:59:04 2 0 0 2 35 4 157 0 156 1 0 0 1 99 0 14 27 11 51
12-17 09:59:14 5 0 0 1 51 26 463 0 458 2 0 1 2 97 0 16 76 42 57
12-17 09:59:25 6 0 0 11 104 21 967 0 977 1 0 1 2 96 0 26 30 25 52
12-17 09:59:35 26 0 1 9 146 46 1467 1 1471 9 0 3 3 94 0 255 56 67 52
12-17 09:59:45 8 0 -1 11 126 10 1908 1 1914 1 0 2 3 95 0 14 35 16 86
12-17 09:59:55 11 0 0 2 115 91 2464 2 2465 0 0 1 3 96 0 33 111 78 53
12-17 10:00:05 1103 0 0 12 1249 289 3117 4 3125 2 0 10 3 86 0 739 336 794 332
12-17 10:00:15 1162 0 14 28 2632 1055 3414 5 3357 22 1 7 5 88 1 352 550 634 498
12-17 10:00:26 565 0 5 7 1302 319 4929 2 4936 3 1 3 4 93 0 175 264 245 239
12-17 10:00:36 376 0 9 20 622 200 4151 2 4165 10 1 3 3 94 0 175 317 187 315
12-17 10:00:46 331 0 -28 0 578 313 4337 1 4331 2 1 3 3 94 0 153 270 302 322
12-17 10:00:56 598 0 0 0 756 256 5204 8 5203 2 1 5 2 93 0 160 280 233 284
12-17 10:01:06 730 0 0 8 843 628 2955 56 2964 3 1 3 2 95 0 380 291 621 741
总结:
一张含有nclob字段表的查询语句,由于SQL在读取lob字段的时候采用direct path的方式,并且access LOB字段所产生的sql_id特殊性,迷惑了我们很久,
同时由于这个APP系统的特殊性,大量的瞬间查询导致了system iowait等待,在awr里反映出大量的direct path read等待,伴随着大量的Network等待,这些都是逐一显现出来的
同时由于大量的io请求,导致cpu的繁忙,从而导致了idle的急速下降,这些都是接二连三所出现的反映。最后通过cache这些缓存到应用端 从而解决了问题。