Got a message from a user of a not-much-used dev database that their SQL Developer connection was timing out. Here’s a walkthrough of how I troubleshot (troubleshooted?) the issue to find that the root cause was a kernel issue.
I could connect as sys:
[oracle@MYDB ~]$ sqlplus "/ as sysdba" Connected.
but connecting as scott just hung. So I check session waits:
SET LINES 130 col username FOR a5 col machine FOR a10 col event FOR a25 col sid FOR 99999 col ser FOR 99999 col blkr FOR 99999 col fblkr FOR 99999 SELECT s.sid, s.username, s.machine, p.pname, w.event, w.p1, w.p2, w.p3, w.seconds_in_wait secs, s.blocking_session blkr, s.final_blocking_session fblkr FROM v$session_wait w, v$session s, v$process p WHERE s.sid=w.sid AND s.paddr=p.addr AND w.event <> 'SQL*Net message from client' AND w.event<>'rdbms ipc message' --and s.username is not null ORDER BY s.sid; SID USERN MACHINE PNAME EVENT P1 P2 P3 SECS BLKR FBLKR ------ ----- ---------- ----- ------------------------------ ---------- ---------- ---------- ---------- ------ ------ 4 MYDB SMON buffer busy waits 4 144 19 309172 125 121 15 SYS MYDB enq: WL - contention 1464598532 2803362147 867667795 287114 243 243 16 SYS MYDB enq: WL - contention 1464598532 2803362147 867667795 200715 243 243 17 MYDB J009 buffer busy waits 8 14403 1 308745 252 121 19 MYDB J001 log file switch (archiving nee 0 0 0 125 121 121 ded) 20 NROSHAK-CA enq: SQ - contention 1397817350 444 0 104863 259 121 22 Julians-Ma enq: SQ - contention 1397817350 444 0 169427 259 121 cBook-Pro- 2.local 243 MYDB ARC0 log file sequential READ 436 0 0 311285
I see lots of processes stuck waiting on LGWR (121)
SID USERN MACHINE PNAME EVENT P1 P2 P3 SECS BLKR FBLKR ------ ----- ---------- ----- ------------------------------ ---------- ---------- ---------- ---------- ------ ------ 259 Julians-Ma buffer busy waits 8 1081 1 169800 383 121 cBook-Pro- 2.local SID USERN MACHINE PNAME EVENT P1 P2 P3 SECS BLKR FBLKR ------ ----- ---------- ----- ------------------------------ ---------- ---------- ---------- ---------- ------ ------ 383 MYDB J015 log file switch (archiving nee 0 0 0 4121 121 121 ded)
but LGWR itself is in idle wait (rdbms ipc message) for 0 secs, seems to be progressing…
SID USERN MACHINE PNAME EVENT P1 P2 P3 SECS BLKR FBLKR ------ ----- ---------- ----- ------------------------------ ---------- ---------- ---------- ---------- ------ ------ 121 MYDB LGWR rdbms ipc message 119 0 0 0
Let’s dump LGWR’s processstate to check what it’s been up to recently:
SQL> SELECT pid, spid FROM v$process WHERE pname='LGWR' 2 / PID SPID ---------- ------------------------ 13 19518 SQL> oradebug setospid 19518 Oracle pid: 13, Unix process pid: 19518, image: oracle@MYDB (LGWR) SQL> oradebug dump processstate 258 Statement processed.
in my session trace “MYDB_ora_5078.trc” :
*** 2015-02-17 09:10:22.730 Processing Oradebug command 'setospid 19518' *** 2015-02-17 09:10:22.731 Oradebug command 'setospid 19518' console output: Oracle pid: 13, Unix process pid: 19518, image: oracle@MYDB (LGWR) *** 2015-02-17 09:10:40.863 Processing Oradebug command 'dump processstate 258' *** 2015-02-17 09:10:40.863 Issuing ORADEBUG command (#13) 'dump processstate 258' to process 'Unix process pid: 19518, image: oracle@MYDB (LGWR)' *** 2015-02-17 09:10:40.878 Oradebug command 'dump processstate 258' console output: <none>
in “MYDB_lgwr_19518.trc” :
*** 2015-02-17 09:10:40.863 Received ORADEBUG command (#13) 'dump processstate 258' from process '5078' =================================================== PROCESS STATE ------------- Process global information: process: 0x114b4f738, call: 0x1131afc18, xact: (nil), curses: 0x114e1c148, usrses: 0x114e1c148 in_exception_handler: no ... ---------------------------------------- SO: 0x114e1c148, type: 4, owner: 0x114b4f738, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x114b4f738, name=session, file=ksu.h LINE:13957, pg=0 conuid=1 (session) sid: 121 ser: 25306 trans: (nil), creator: 0x114b4f738 flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x409) -/-/INC DID: 0001-000D-00000009, short-term DID: 0000-0000-00000000 txn branch: (nil) con_id/con_uid/con_name: 1/1/CDB$ROOT con_logonuid: 1 con_logonid: 1 edition#: 0 user#/name: 0/SYS oct: 0, prv: 0, sql: (nil), psql: (nil) stats: 0x106b13d18, PX stats: 0xcebd750 ksuxds FALSE at location: 0 service name: SYS$BACKGROUND Current Wait Stack: 0: waiting for 'rdbms ipc message' timeout=0x122, =0x0, =0x0 wait_id=48249743 seq_num=15984 snap_id=1 wait times: snap=0.047099 sec, exc=0.047099 sec, total=0.047099 sec wait times: max=2.900000 sec, heur=0.047099 sec wait counts: calls=1 os=1 in_wait=1 iflags=0x5a8 There are 27 sessions blocked by this session. Dumping one waiter: inst: 1, sid: 363, ser: 10326 wait event: 'log file switch (archiving needed)'
… all the waits are idle or quick checks of control file.
OK, so LGWR looks OK, let’s try forcing a log switch by archiving all!
SQL> ALTER system archive log ALL 2 / ... hangs ....
Looking at alert log I see another user actually tried this earlier:
Tue Feb 17 01:01:10 2015 ALTER SYSTEM ARCHIVE LOG [[EOL]]
That’s 8 hrs ago, looks like we’ve been trying to archive log for a long time!
Let’s take a second look at the archivers, two in particular are stuck:
SID SER USERN MACHINE PNAME EVENT P1 P2 P3 SECS BLKR FBLKR ------ ------ ----- ---------- ----- ------------------------- ---------- ---------- ---------- ---------- ------ ------ 128 42802 MYDB ARC3 log file sequential READ 439 0 0 312961 243 9464 MYDB ARC0 log file sequential READ 436 0 0 313400
Looks like these two archive procs have been hung up, there is plenty of space but for some reason two are stuck on log file sequential read.
SQL> ALTER system KILL SESSION '243,9464'; ALTER system KILL SESSION '243,9464' * ERROR at line 1: ORA-00031: SESSION marked FOR KILL
OK, let’s try to kill arc0 and arc3 at the OS level
[oracle@MYDB ~]$ ps -ef | grep -i arc oracle 1506 1 0 Jan13 ? 00:15:45 ora_arc0_MYDB oracle 1514 1 0 Jan13 ? 00:15:04 ora_arc2_MYDB oracle 5569 5052 0 09:22 pts/1 00:00:00 grep -i arc oracle 19612 1 0 Feb13 ? 00:00:00 ora_arc0_MYDB oracle 19616 1 0 Feb13 ? 00:00:10 ora_arc1_MYDB oracle 19620 1 0 Feb13 ? 00:00:04 ora_arc2_MYDB oracle 19624 1 0 Feb13 ? 00:00:00 ora_arc3_MYDB [oracle@MYDB ~]$ kill -9 19612 19624
But that didn’t work, they’re still there!
[oracle@MYDB ~]$ ps -leaf | grep -i arc 0 D oracle 1506 1 0 80 0 - 877413 ext4_f Jan13 ? 00:15:45 ora_arc0_MYDB 0 D oracle 1514 1 0 80 0 - 877413 ext4_f Jan13 ? 00:15:04 ora_arc2_MYDB 0 S oracle 5589 5052 0 80 0 - 25825 pipe_w 09:24 pts/1 00:00:00 grep -i arc 0 D oracle 19612 1 0 80 0 - 873285 ext4_f Feb13 ? 00:00:00 ora_arc0_MYDB 0 S oracle 19616 1 0 80 0 - 871237 semtim Feb13 ? 00:00:10 ora_arc1_MYDB 0 S oracle 19620 1 0 80 0 - 877381 semtim Feb13 ? 00:00:04 ora_arc2_MYDB 0 D oracle 19624 1 0 80 0 - 873285 ext4_f Feb13 ? 00:00:00 ora_arc3_MYDB
Uh oh – the archive processes are in ‘D’ state aka uninterruptible wait. So the kill command didn’t do anything, and we still have archivers stuck in ‘D’ state, also it looks like we have leftover arch procs from the LAST time we had the DB up (there shouldn’t be two arc0 and arc2 for MYDB)
Hmm, let’s go in as root and try to clean up…
[opc@MYDB ~]$ sudo su root [root@MYDB opc]# [root@MYDB opc]# ps -leaf | grep arc 0 D oracle 1506 1 0 80 0 - 877413 ext4_f Jan13 ? 00:15:45 ora_arc0_MYDB 0 D oracle 1514 1 0 80 0 - 877413 ext4_f Jan13 ? 00:15:04 ora_arc2_MYDB 0 S root 5689 5663 0 80 0 - 25825 pipe_w 09:30 pts/2 00:00:00 grep arc 0 D oracle 19612 1 0 80 0 - 873285 ext4_f Feb13 ? 00:00:00 ora_arc0_MYDB 0 S oracle 19616 1 0 80 0 - 871237 semtim Feb13 ? 00:00:10 ora_arc1_MYDB 0 S oracle 19620 1 0 80 0 - 877381 semtim Feb13 ? 00:00:04 ora_arc2_MYDB 0 D oracle 19624 1 0 80 0 - 873285 ext4_f Feb13 ? 00:00:00 ora_arc3_MYDB [root@MYDB opc]# kill -9 1506 1514 [root@MYDB opc]# ps -leaf | grep arc 0 D oracle 1506 1 0 80 0 - 877413 ext4_f Jan13 ? 00:15:45 ora_arc0_MYDB 0 D oracle 1514 1 0 80 0 - 877413 ext4_f Jan13 ? 00:15:04 ora_arc2_MYDB ...
that didn’t do anything.
[root@MYDB opc]# kill -11 1506 1514
also fails to work.
[root@MYDB opc]# strace -p 1506 Process 1506 attached - interrupt to quit
… nothing.
Some articles on ‘D’ state (uninterruptible sleep):
http://blog.kevac.org/2013/02/uninterruptible-sleep-d-state.html
http://lwn.net/Articles/288056/
A helpful ps cmd from blog.kevac.org article:
[root@MYDB opc]# ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep arc 1 1506 oracle Ds 0.0 ora_arc0_MYDB ext4_file_write_iter 1 1514 oracle Ds 0.0 ora_arc2_MYDB ext4_file_write_iter 1 19612 oracle Ds 0.0 ora_arc0_MYDB ext4_file_write_iter 1 19616 oracle Ss 0.0 ora_arc1_MYDB semtimedop 1 19620 oracle Ss 0.0 ora_arc2_MYDB semtimedop 1 19624 oracle Ds 0.0 ora_arc3_MYDB ext4_file_write_iter
The wchan of the procs in ‘D’ state says “ext4_file_write_iter” , this tells us they are all waiting on this system call related to Filesystem.
Let’s check dmesg for any related messages:
INFO: task ora_arc2_MYDB:1514 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ora_arc2_MYDB D ffff8803bfc12180 0 1514 1 0x00000080 ffff8803a51e7cf8 0000000000000286 ffff8803a51e7cc8 ffffffff00000000 0000000000012180 ffff8803a51e7fd8 ffff8803a51e6010 0000000000012180 ffff8803a51e7fd8 0000000000012180 ffffffff81791020 ffff8803a51e4300 Call Trace: [<ffffffff8150dfaf>] schedule+0x3f/0x60 [<ffffffffa0063fa5>] ext4_file_write_iter+0x195/0x1e0 [ext4] [<ffffffff81091440>] ? wake_up_bit+0x40/0x40 [<ffffffff8151000e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30 [<ffffffffa0064073>] ext4_file_write+0x83/0xa0 [ext4] [<ffffffffa0063ff0>] ? ext4_file_write_iter+0x1e0/0x1e0 [ext4] [<ffffffff811b4a5c>] aio_rw_vect_retry+0x7c/0x1e0 [<ffffffff811b49e0>] ? lookup_ioctx+0x90/0x90 [<ffffffff811b6660>] aio_run_iocb+0x70/0x1a0 [<ffffffff811b7188>] io_submit_one+0x188/0x260 [<ffffffff811b7356>] do_io_submit+0xf6/0x1c0 [<ffffffff811b7430>] sys_io_submit+0x10/0x20 [<ffffffff81518442>] system_call_fastpath+0x16/0x1b
In any case, only way to get rid of these procs stuck in ‘D’ state is
SQL> shutdown abort
[root@MYDB opc]# shutdown -r now
Next place to investigate would be at the OS level. Archiver was stuck in ‘D’ state on trying to write to an ext4 file, this is a kernel problem and not a database problem!
Stack in dmesg looks a little bit like this kernel bug:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg826761.html