Troubleshooting walkthrough

Got a message from a user of a not-much-used dev database that their SQL Developer connection was timing out. Here’s a walkthrough of how I troubleshot (troubleshooted?) the issue to find that the root cause was a kernel issue.

I could connect as sys:

[oracle@MYDB ~]$ sqlplus "/ as sysdba"
Connected.

but connecting as scott just hung. So I check session waits:

SET LINES 130
col username FOR a5
col machine FOR a10
col event FOR a25
col sid FOR 99999
col ser FOR 99999
col blkr FOR 99999
col fblkr FOR 99999
 
SELECT s.sid, s.username, s.machine, p.pname,
w.event, w.p1, w.p2, w.p3,
w.seconds_in_wait secs, s.blocking_session blkr, s.final_blocking_session fblkr
FROM v$session_wait w, v$session s, v$process p
WHERE s.sid=w.sid AND s.paddr=p.addr
AND w.event <> 'SQL*Net message from client'
AND w.event<>'rdbms ipc message'
--and s.username is not null
ORDER BY s.sid;
 
   SID USERN MACHINE    PNAME EVENT                                  P1         P2         P3       SECS   BLKR  FBLKR
------ ----- ---------- ----- ------------------------------ ---------- ---------- ---------- ---------- ------ ------
     4       MYDB       SMON  buffer busy waits                       4        144         19     309172    125    121
    15 SYS   MYDB             enq: WL - contention           1464598532 2803362147  867667795     287114    243    243
    16 SYS   MYDB             enq: WL - contention           1464598532 2803362147  867667795     200715    243    243
    17       MYDB       J009  buffer busy waits                       8      14403          1     308745    252    121
    19       MYDB       J001  log file switch (archiving nee          0          0          0        125    121    121
                              ded)
 
    20       NROSHAK-CA       enq: SQ - contention           1397817350        444          0     104863    259    121
    22       Julians-Ma       enq: SQ - contention           1397817350        444          0     169427    259    121
             cBook-Pro-
             2.local
   243       MYDB       ARC0  log file sequential READ              436          0          0     311285

I see lots of processes stuck waiting on LGWR (121)

   SID USERN MACHINE    PNAME EVENT                                  P1         P2         P3       SECS   BLKR  FBLKR
------ ----- ---------- ----- ------------------------------ ---------- ---------- ---------- ---------- ------ ------
   259       Julians-Ma       buffer busy waits                       8       1081          1     169800    383    121
             cBook-Pro-
             2.local
 
   SID USERN MACHINE    PNAME EVENT                                  P1         P2         P3       SECS   BLKR  FBLKR
------ ----- ---------- ----- ------------------------------ ---------- ---------- ---------- ---------- ------ ------
   383       MYDB       J015  log file switch (archiving nee          0          0          0       4121    121    121
                              ded)

but LGWR itself is in idle wait (rdbms ipc message) for 0 secs, seems to be progressing…

   SID USERN MACHINE    PNAME EVENT                                  P1         P2         P3       SECS   BLKR  FBLKR
------ ----- ---------- ----- ------------------------------ ---------- ---------- ---------- ---------- ------ ------
   121       MYDB       LGWR  rdbms ipc message                     119          0          0          0

Let’s dump LGWR’s processstate to check what it’s been up to recently:

SQL> SELECT pid, spid FROM v$process WHERE pname='LGWR'
  2  /
 
       PID SPID
---------- ------------------------
        13 19518
 
SQL> oradebug setospid 19518
Oracle pid: 13, Unix process pid: 19518, image: oracle@MYDB (LGWR)
SQL> oradebug dump processstate 258
Statement processed.

in my session trace “MYDB_ora_5078.trc” :

*** 2015-02-17 09:10:22.730
Processing Oradebug command 'setospid 19518'
 
*** 2015-02-17 09:10:22.731
Oradebug command 'setospid 19518' console output:
Oracle pid: 13, Unix process pid: 19518, image: oracle@MYDB (LGWR)
 
*** 2015-02-17 09:10:40.863
Processing Oradebug command 'dump processstate 258'
 
*** 2015-02-17 09:10:40.863
Issuing ORADEBUG command (#13) 'dump processstate 258' to process 'Unix process pid: 19518, image: oracle@MYDB (LGWR)'
 
*** 2015-02-17 09:10:40.878
Oradebug command 'dump processstate 258' console output: <none>

in “MYDB_lgwr_19518.trc” :

*** 2015-02-17 09:10:40.863
Received ORADEBUG command (#13) 'dump processstate 258' from process '5078'
===================================================
PROCESS STATE
-------------
Process global information:
     process: 0x114b4f738, call: 0x1131afc18, xact: (nil), curses: 0x114e1c148, usrses: 0x114e1c148
     in_exception_handler: no
...
    ----------------------------------------
    SO: 0x114e1c148, type: 4, owner: 0x114b4f738, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x114b4f738, name=session, file=ksu.h LINE:13957, pg=0 conuid=1
    (session) sid: 121 ser: 25306 trans: (nil), creator: 0x114b4f738
              flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x409) -/-/INC
              DID: 0001-000D-00000009, short-term DID: 0000-0000-00000000
              txn branch: (nil)
              con_id/con_uid/con_name: 1/1/CDB$ROOT
              con_logonuid: 1 con_logonid: 1
              edition#: 0              user#/name: 0/SYS
              oct: 0, prv: 0, sql: (nil), psql: (nil)
              stats: 0x106b13d18, PX stats: 0xcebd750
    ksuxds FALSE at location: 0
    service name: SYS$BACKGROUND
    Current Wait Stack:
     0: waiting for 'rdbms ipc message'
        timeout=0x122, =0x0, =0x0
        wait_id=48249743 seq_num=15984 snap_id=1
        wait times: snap=0.047099 sec, exc=0.047099 sec, total=0.047099 sec
        wait times: max=2.900000 sec, heur=0.047099 sec
        wait counts: calls=1 os=1
        in_wait=1 iflags=0x5a8
    There are 27 sessions blocked by this session.
    Dumping one waiter:
      inst: 1, sid: 363, ser: 10326
      wait event: 'log file switch (archiving needed)'

… all the waits are idle or quick checks of control file.
OK, so LGWR looks OK, let’s try forcing a log switch by archiving all!

SQL> ALTER system archive log ALL
  2  /
... hangs ....

Looking at alert log I see another user actually tried this earlier:

Tue Feb 17 01:01:10 2015
ALTER SYSTEM ARCHIVE LOG
[[EOL]]

That’s 8 hrs ago, looks like we’ve been trying to archive log for a long time!

Let’s take a second look at the archivers, two in particular are stuck:

   SID    SER USERN MACHINE    PNAME EVENT                             P1         P2         P3       SECS   BLKR  FBLKR
------ ------ ----- ---------- ----- ------------------------- ---------- ---------- ---------- ---------- ------ ------
   128  42802       MYDB       ARC3  log file sequential READ         439          0          0     312961
   243   9464       MYDB       ARC0  log file sequential READ         436          0          0     313400

Looks like these two archive procs have been hung up, there is plenty of space but for some reason two are stuck on log file sequential read.

SQL> ALTER system KILL SESSION '243,9464';
ALTER system KILL SESSION '243,9464'
*
ERROR at line 1:
ORA-00031: SESSION marked FOR KILL

OK, let’s try to kill arc0 and arc3 at the OS level

[oracle@MYDB ~]$ ps -ef | grep -i arc
oracle    1506     1  0 Jan13 ?        00:15:45 ora_arc0_MYDB
oracle    1514     1  0 Jan13 ?        00:15:04 ora_arc2_MYDB
oracle    5569  5052  0 09:22 pts/1    00:00:00 grep -i arc
oracle   19612     1  0 Feb13 ?        00:00:00 ora_arc0_MYDB
oracle   19616     1  0 Feb13 ?        00:00:10 ora_arc1_MYDB
oracle   19620     1  0 Feb13 ?        00:00:04 ora_arc2_MYDB
oracle   19624     1  0 Feb13 ?        00:00:00 ora_arc3_MYDB
 
[oracle@MYDB ~]$ kill -9 19612 19624

But that didn’t work, they’re still there!

[oracle@MYDB ~]$ ps -leaf | grep -i arc
0 D oracle    1506     1  0  80   0 - 877413 ext4_f Jan13 ?       00:15:45 ora_arc0_MYDB
0 D oracle    1514     1  0  80   0 - 877413 ext4_f Jan13 ?       00:15:04 ora_arc2_MYDB
0 S oracle    5589  5052  0  80   0 - 25825 pipe_w 09:24 pts/1    00:00:00 grep -i arc
0 D oracle   19612     1  0  80   0 - 873285 ext4_f Feb13 ?       00:00:00 ora_arc0_MYDB
0 S oracle   19616     1  0  80   0 - 871237 semtim Feb13 ?       00:00:10 ora_arc1_MYDB
0 S oracle   19620     1  0  80   0 - 877381 semtim Feb13 ?       00:00:04 ora_arc2_MYDB
0 D oracle   19624     1  0  80   0 - 873285 ext4_f Feb13 ?       00:00:00 ora_arc3_MYDB

Uh oh – the archive processes are in ‘D’ state aka uninterruptible wait. So the kill command didn’t do anything, and we still have archivers stuck in ‘D’ state, also it looks like we have leftover arch procs from the LAST time we had the DB up (there shouldn’t be two arc0 and arc2 for MYDB)

Hmm, let’s go in as root and try to clean up…

[opc@MYDB ~]$ sudo su root
[root@MYDB opc]#
 
[root@MYDB opc]# ps -leaf | grep arc
0 D oracle    1506     1  0  80   0 - 877413 ext4_f Jan13 ?       00:15:45 ora_arc0_MYDB
0 D oracle    1514     1  0  80   0 - 877413 ext4_f Jan13 ?       00:15:04 ora_arc2_MYDB
0 S root      5689  5663  0  80   0 - 25825 pipe_w 09:30 pts/2    00:00:00 grep arc
0 D oracle   19612     1  0  80   0 - 873285 ext4_f Feb13 ?       00:00:00 ora_arc0_MYDB
0 S oracle   19616     1  0  80   0 - 871237 semtim Feb13 ?       00:00:10 ora_arc1_MYDB
0 S oracle   19620     1  0  80   0 - 877381 semtim Feb13 ?       00:00:04 ora_arc2_MYDB
0 D oracle   19624     1  0  80   0 - 873285 ext4_f Feb13 ?       00:00:00 ora_arc3_MYDB
 
[root@MYDB opc]# kill -9 1506 1514
[root@MYDB opc]# ps -leaf | grep arc
 
0 D oracle    1506     1  0  80   0 - 877413 ext4_f Jan13 ?       00:15:45 ora_arc0_MYDB
0 D oracle    1514     1  0  80   0 - 877413 ext4_f Jan13 ?       00:15:04 ora_arc2_MYDB
...

that didn’t do anything.

[root@MYDB opc]# kill -11 1506 1514

also fails to work.

[root@MYDB opc]# strace -p 1506
Process 1506 attached - interrupt to quit

… nothing.

Some articles on ‘D’ state (uninterruptible sleep):
http://blog.kevac.org/2013/02/uninterruptible-sleep-d-state.html
http://lwn.net/Articles/288056/

A helpful ps cmd from blog.kevac.org article:

[root@MYDB opc]# ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep arc
    1  1506 oracle   Ds    0.0 ora_arc0_MYDB   ext4_file_write_iter
    1  1514 oracle   Ds    0.0 ora_arc2_MYDB   ext4_file_write_iter
    1 19612 oracle   Ds    0.0 ora_arc0_MYDB   ext4_file_write_iter
    1 19616 oracle   Ss    0.0 ora_arc1_MYDB   semtimedop
    1 19620 oracle   Ss    0.0 ora_arc2_MYDB   semtimedop
    1 19624 oracle   Ds    0.0 ora_arc3_MYDB   ext4_file_write_iter

The wchan of the procs in ‘D’ state says “ext4_file_write_iter” , this tells us they are all waiting on this system call related to Filesystem.

Let’s check dmesg for any related messages:

INFO: task ora_arc2_MYDB:1514 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ora_arc2_MYDB   D ffff8803bfc12180     0  1514      1 0x00000080
 ffff8803a51e7cf8 0000000000000286 ffff8803a51e7cc8 ffffffff00000000
 0000000000012180 ffff8803a51e7fd8 ffff8803a51e6010 0000000000012180
 ffff8803a51e7fd8 0000000000012180 ffffffff81791020 ffff8803a51e4300
Call Trace:
 [<ffffffff8150dfaf>] schedule+0x3f/0x60
 [<ffffffffa0063fa5>] ext4_file_write_iter+0x195/0x1e0 [ext4]
 [<ffffffff81091440>] ? wake_up_bit+0x40/0x40
 [<ffffffff8151000e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
 [<ffffffffa0064073>] ext4_file_write+0x83/0xa0 [ext4]
 [<ffffffffa0063ff0>] ? ext4_file_write_iter+0x1e0/0x1e0 [ext4]
 [<ffffffff811b4a5c>] aio_rw_vect_retry+0x7c/0x1e0
 [<ffffffff811b49e0>] ? lookup_ioctx+0x90/0x90
 [<ffffffff811b6660>] aio_run_iocb+0x70/0x1a0
 [<ffffffff811b7188>] io_submit_one+0x188/0x260
 [<ffffffff811b7356>] do_io_submit+0xf6/0x1c0
 [<ffffffff811b7430>] sys_io_submit+0x10/0x20
 [<ffffffff81518442>] system_call_fastpath+0x16/0x1b

In any case, only way to get rid of these procs stuck in ‘D’ state is

SQL> shutdown abort
[root@MYDB opc]# shutdown -r now

Next place to investigate would be at the OS level. Archiver was stuck in ‘D’ state on trying to write to an ext4 file, this is a kernel problem and not a database problem!

Stack in dmesg looks a little bit like this kernel bug:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg826761.html

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.