今天下午遇到的一个诡异CASE,现象为系统诡异的死亡,目前判定为sqluldr2工具导致 (同事使用sqluldr2 用具 执行了sqluldr2_linux64_10204.bin -help 命令)
抛出的错误如下 (/var/log/message)
Sep 11 09:00:54 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808 Sep 11 09:07:08 db18 last message repeated 2 times Sep 11 09:08:47 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808 Sep 11 09:09:54 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808 Sep 11 09:10:07 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808 Sep 11 09:10:46 db18 kernel: tnslsnr[19787]: segfault at 0000000000000000 rip 00002b92c7d5bebe rsp 00007fff0a1fea90 error 4 Sep 11 10:49:59 db18 auditd[8397]: Audit daemon rotating log files Sep 11 10:53:41 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808 Sep 11 10:54:23 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808 Sep 11 12:50:44 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808 Sep 11 12:54:37 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808 Sep 11 13:56:56 db18 nrpe[15359]: Host 10.4.0.238 is not allowed to talk to us! Sep 11 13:57:08 db18 nrpe[15425]: Host 10.4.0.238 is not allowed to talk to us! Sep 11 14:17:49 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808
当时系统的SELinux为enable,阻止了sqluldr2对于oracle库文件的操作,多次尝试之后在09:10:46系统出现了异常,当时仅仅表现为listener的异常关闭,自此系统开始了”慢性”死忙
Last login: Tue Sep 10 14:59:41 2013 from 10.4.0.232 [oracle@db18 ~]$ lsnrctl status LSNRCTL for Linux: Version 10.2.0.5.0 - Production on 11-SEP-2013 09:18:47 Copyright (c) 1991, 2010, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=0.0.0.0)(PORT=1522))) TNS-12541: TNS:no listener TNS-12560: TNS:protocol adapter error TNS-00511: No listener Linux Error: 111: Connection refused [oracle@db18 ~]$ [oracle@db18 ~]$ [oracle@db18 ~]$ ora active 09:18:54 up 702 days, 14:57, 2 users, load average: 3.06, 2.08, 0.93 no rows selected [oracle@db18 ~]$ lsnrctl start LSNRCTL for Linux: Version 10.2.0.5.0 - Production on 11-SEP-2013 09:19:16 Copyright (c) 1991, 2010, Oracle. All rights reserved. Starting /home/oracle/product/10205/db1/bin/tnslsnr: please wait... TNSLSNR for Linux: Version 10.2.0.5.0 - Production System parameter file is /home/oracle/product/10205/db1/network/admin/listener.ora Log messages written to /home/oracle/product/10205/db1/network/log/listener.log Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=0.0.0.0)(PORT=1522))) Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=0.0.0.0)(PORT=1522))) STATUS of the LISTENER ------------------------ Alias LISTENER Version TNSLSNR for Linux: Version 10.2.0.5.0 - Production Start Date 11-SEP-2013 09:19:16 Uptime 0 days 0 hr. 0 min. 0 sec Trace Level off Security ON: Local OS Authentication SNMP OFF Listener Parameter File /home/oracle/product/10205/db1/network/admin/listener.ora Listener Log File /home/oracle/product/10205/db1/network/log/listener.log Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=0.0.0.0)(PORT=1522))) Services Summary... Service "misc" has 1 instance(s). Instance "misc", status UNKNOWN, has 1 handler(s) for this service... The command completed successfully [oracle@db18 ~]$ [oracle@db18 ~]$ ora active 09:19:22 up 702 days, 14:57, 2 users, load average: 3.09, 2.18, 1.00
12号下午出现的问题,很多process占用大量的CPU资源,包括background process,多个进程处于running假死阶段,sys占用大量系统资源。
database alert log如下:
Errors in file /data/oracle/admin/misc/udump/misc_ora_28401.trc: ORA-27300: OS system dependent operation:invalid_process_id failed with status: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgpalive1
分析问题,查看ora background process进程状态 发现如下异常
[oracle@db18 arch]$ ps -ef | grep -i ora_ oracle 24019 1 0 May18 ? 01:49:30 ora_pmon_misc oracle 24023 1 0 May18 ? 00:33:02 ora_psp0_misc oracle 24027 1 0 May18 ? 00:34:32 ora_mman_misc oracle 24031 1 0 May18 ? 00:59:23 ora_dbw0_misc oracle 24033 1 0 May18 ? 01:02:13 ora_dbw1_misc oracle 24035 1 0 May18 ? 00:58:57 ora_dbw2_misc oracle 24037 1 0 May18 ? 00:59:24 ora_dbw3_misc oracle 24039 1 0 May18 ? 00:59:06 ora_dbw4_misc oracle 24043 1 0 May18 ? 00:59:35 ora_dbw5_misc oracle 24047 1 0 May18 ? 04:28:36 ora_lgwr_misc oracle 24049 1 0 May18 ? 09:41:15 ora_ckpt_misc oracle 24053 1 0 May18 ? 00:36:44 ora_smon_misc oracle 24057 1 0 May18 ? 00:28:39 ora_reco_misc oracle 24061 1 1 May18 ? 1-06:42:09 ora_mmon_misc oracle 24065 1 0 May18 ? 22:12:24 ora_mmnl_misc oracle 24347 1 0 May18 ? 00:37:10 ora_arc0_misc oracle 24351 1 0 May18 ? 00:43:10 ora_arc1_misc oracle 24355 1 0 May18 ? 01:42:59 ora_lns1_misc oracle 24543 1 0 May18 ? 00:28:31 ora_qmnc_misc oracle 24780 1 0 May18 ? 00:10:05 ora_q000_misc oracle 24969 1 0 May18 ? 00:09:25 ora_q001_misc oracle 26077 25306 0 15:31 pts/8 00:00:00 tail -f misc_ora_21763.trc oracle 27634 22109 0 15:44 pts/6 00:00:00 grep -i ora_ oracle 27867 24023 98 Sep11 ? 1-06:03:47 ora_psp0_misc ------------刷新时间为Sep11 oracle 27868 27867 98 Sep11 ? 1-06:03:38 ora_psp0_misc ------------为衍生进程
PSP进程出现异常状态,对于PSP0进程的解释如下:
PSP stands for Process SPawner and this process has the job of creating and managing other Oracle processes. The initial set of background processes are spawned by the startup process. After the instance is started, PSP0 is responsible for spawning any required background processes.
当时系统的状态为无法checkpoint 无法 syn log ,包括alter database set 等命令也出现hang死状态
[oracle@db18 ~]$ ora active 15:57:03 up 703 days, 21:35, 12 users, load average: 24.11, 26.99, 50.27 SID USERNAME MACHINE EVENT PARAM W WT SQL ST LT LOGON_TIME ------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ---------- 3601 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006158/0 0 0 /frw83qw2fvsa1 K 0 1088732 3607 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006175/0 0 0 /frw83qw2fvsa1 K 0 3857 3611 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006137/0 0 0 /frw83qw2fvsa1 K 0 1088741 3616 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006169/0 0 0 /frw83qw2fvsa1 K 0 3895 3634 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006140/0 0 0 /frw83qw2fvsa1 K 0 1088735 3635 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006173/0 0 0 /frw83qw2fvsa1 K 0 3889 3646 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006209/0 0 0 /frw83qw2fvsa1 A 0 1023 3650 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006174/0 0 0 /frw83qw2fvsa1 K 0 3882 3661 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006152/0 0 0 /frw83qw2fvsa1 K 0 5346 3664 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006138/0 0 0 /frw83qw2fvsa1 K 0 877657 3668 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006200/0 0 0 /frw83qw2fvsa1 A 0 1027 3671 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006198/0 0 0 /frw83qw2fvsa1 A 0 1027 3676 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006184/0 0 0 /6p07vqnwqrsyw A 0 1032 3691 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006210/0 0 0 /frw83qw2fvsa1 A 0 1011 3696 LOG_DATA2 JDBC Thin Client log file sync 10795/2810006141/0 0 0 /frw83qw2fvsa1 K 0 5346 3698 SYS sqlplus@db18 reliable message 70088494424/70158516 0 0 / A 0 390
表现为ckpt lgwr dbwn进程全部死亡 .. oradebug dump systemstat 速度极其缓慢 oradebug dump errorstack hang —- sys 占用大量资源
awk -f ass109.awk systemdump.txt > systemdump.log —并没有发现任何异常等待以及blocker信息
[oracle@db18 ~]$ tail -f -n 50 systemdump.log Cmd: Select 296:waiting for 'SQL*Net message from client' Cmd: Select 297:waiting for 'SQL*Net message from client' Cmd: Select 298:waiting for 'SQL*Net message from client' Cmd: PL/SQL Execute 299:waiting for 'SQL*Net message from client' Cmd: PL/SQL Execute 300:waiting for 'SQL*Net message from client' Cmd: PL/SQL Execute 302:waiting for 'SQL*Net message from client' Cmd: Insert 444:waiting for 'SQL*Net message from client' Cmd: Select 532:waiting for 'SQL*Net message from client' Cmd: Select 536:waiting for 'SQL*Net message from client' Cmd: Select 538:waiting for 'SQL*Net message from client' Cmd: Select NO BLOCKING PROCESSES FOUND System State 2 ~~~~~~~~~~~~~~~~ 1: 2: waiting for 'pmon timer' 3: waiting for 'rdbms ipc message' 4: last wait for 'rdbms ipc message' 5: waiting for 'rdbms ipc message' 6: waiting for 'rdbms ipc message' 7: waiting for 'rdbms ipc message' 8: waiting for 'rdbms ipc message' 9: waiting for 'rdbms ipc message' 10: waiting for 'rdbms ipc message' 11: waiting for 'rdbms ipc message' 12: waiting for 'rdbms ipc message' 13: waiting for 'rdbms ipc message' 14: waiting for 'smon timer' 15: last wait for 'os thread startup' 16: last wait for 'rdbms ipc message' 17: waiting for 'SQL*Net message from client' Cmd: Select 18: NO BLOCKING PROCESSES FOUND 171760 Lines Processed.
其现象为系统的一个”缓慢”的死亡行为 (通过SAR 定位到一个折点时间)
sar -f /var/log/sa/sar11 ---- sep 11 09:00:01 AM all 0.01 0.00 0.01 0.00 0.00 99.97 09:10:01 AM all 0.03 0.00 0.02 0.00 0.00 99.95 -----> (09:10:46 db18 kernel: tnslsnr[19787]: segfault at 0000000000000000 rip 00002b92c7d5bebe rsp 00007fff0a1fea90 error 4) 09:20:01 AM all 2.18 0.00 2.16 0.00 0.00 95.66 09:30:01 AM all 3.15 0.00 3.13 0.00 0.00 93.71 09:40:01 AM all 3.15 0.00 3.13 0.00 0.00 93.72 09:50:01 AM all 3.15 0.00 3.13 0.00 0.00 93.72 03:00:01 PM all 3.99 0.00 4.39 0.00 0.00 91.62 03:10:01 PM all 4.04 0.00 4.32 0.00 0.00 91.64 03:20:01 PM all 4.47 0.00 4.87 0.00 0.00 90.66 03:30:01 PM all 4.98 0.00 5.47 0.00 0.00 89.55 03:40:01 PM all 4.99 0.00 5.45 0.00 0.00 89.56 03:50:01 PM all 4.99 0.00 5.45 0.00 0.00 89.56 04:00:01 PM all 4.99 0.00 5.45 0.00 0.00 89.56 04:10:01 PM all 4.94 0.00 5.50 0.00 0.00 89.56 04:20:01 PM all 4.82 0.00 5.62 0.00 0.00 89.56 10:20:01 PM all 10.72 0.00 12.22 0.00 0.00 77.06 10:30:01 PM all 10.73 0.00 12.21 0.00 0.00 77.06 10:40:01 PM all 10.69 0.00 12.25 0.00 0.00 77.06 10:50:01 PM all 10.67 0.00 12.27 0.00 0.00 77.06 11:00:01 PM all 10.52 0.00 12.41 0.00 0.00 77.06 11:10:01 PM all 10.42 0.00 12.52 0.00 0.00 77.06 11:20:01 PM all 10.38 0.00 12.56 0.00 0.00 77.06 sar -f /var/log/sa/sar12 ---- sep 12 12:00:01 AM CPU %user %nice %system %iowait %steal %idle 12:10:01 AM all 11.23 0.00 11.72 0.10 0.00 76.95 12:20:02 AM all 11.29 0.00 11.65 0.10 0.00 76.96 12:30:01 AM all 11.30 0.00 11.63 0.09 0.00 76.96 12:40:01 AM all 11.71 0.00 12.36 0.10 0.00 75.83 12:50:01 AM all 12.14 0.00 12.88 0.10 0.00 74.87 01:00:01 AM all 12.08 0.00 12.94 0.10 0.00 74.88 01:10:01 AM all 12.24 0.00 12.79 0.10 0.00 74.88 01:20:01 AM all 12.25 0.00 12.77 0.10 0.00 74.88 10:00:01 AM all 12.08 0.00 37.94 0.10 0.00 49.88 10:10:01 AM all 12.07 0.00 37.95 0.09 0.00 49.88 10:20:01 AM all 12.09 0.00 37.94 0.11 0.00 49.87 10:30:01 AM all 12.10 0.00 37.93 0.10 0.00 49.88 10:40:01 AM all 12.02 0.00 38.00 0.11 0.00 49.87 10:50:01 AM all 12.06 0.00 37.96 0.10 0.00 49.88 11:00:01 AM all 12.37 0.00 37.65 0.10 0.00 49.88 11:10:01 AM all 12.32 0.00 37.70 0.10 0.00 49.88 11:20:01 AM all 12.50 0.00 37.52 0.10 0.00 49.88 11:30:01 AM all 12.59 0.00 37.43 0.10 0.00 49.88 11:40:01 AM all 12.77 0.00 37.25 0.11 0.00 49.87 01:50:01 PM all 13.41 0.00 40.77 0.10 0.00 45.71 02:00:01 PM all 13.41 0.00 40.78 0.11 0.00 45.71 02:10:01 PM all 13.41 0.00 40.78 0.10 0.00 45.71 02:20:01 PM all 13.17 0.00 63.22 0.05 0.00 23.55 02:30:02 PM all 13.21 0.00 86.78 0.00 0.00 0.00 02:40:01 PM all 13.29 0.00 86.71 0.00 0.00 0.00 02:50:01 PM all 13.42 0.00 86.58 0.00 0.00 0.00 03:00:01 PM all 13.58 0.00 86.42 0.00 0.00 0.00 03:10:01 PM all 13.38 0.00 86.62 0.00 0.00 0.00 03:20:01 PM all 13.25 0.00 86.75 0.00 0.00 0.00 03:30:01 PM all 13.19 0.00 86.80 0.00 0.00 0.00 03:40:01 PM all 12.95 0.00 86.25 0.00 0.00 0.79
假想:sqluldr2 与 Selinux的冲突导致了oracle so文件的异常,表现为segfault–listener的异常关闭,进而系统缓慢的”死亡” (建议不要在生产系统直接使用sqluldr2而使用TNS远程连接)