728x90

Environment

  • Red Hat Enterprise Linux release 8.7 (Ootpa)
  • 4.18.0-425.10.1.el8_7.x86_64
  • EXT4 filesystem without a journal.

Issue

  • IBM Informix Dynamic Server oninit process hanging, resulting in unresponsive/hung database connections due to deadlock on ext4 filesystem mounted without a journal.
  • IBM Informix Dynamic Server oninit process blocked with the following call traces.
[428617.156430] INFO: task oninit:4327 blocked for more than 120 seconds.
[428617.156431]       Not tainted 4.18.0-425.10.1.el8_7.x86_64 #1
[428617.156433] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[428617.156433] task:oninit          state:D stack:    0 pid: 4327 ppid:  4300 flags:0x00004084
[428617.156436] Call Trace:
[428617.156437]  __schedule+0x2d1/0x860
[428617.156439]  ? blk_flush_plug_list+0xd7/0x100
[428617.156441]  schedule+0x35/0xa0
[428617.156443]  ext4_unwritten_wait+0x87/0xc0 [ext4]
[428617.156460]  ? finish_wait+0x80/0x80
[428617.156462]  ext4_file_write_iter+0x396/0x3e0 [ext4]
[428617.156478]  aio_write+0xf6/0x1c0
[428617.156480]  ? blk_mq_sched_insert_requests+0x6c/0xf0
[428617.156483]  ? io_submit_one+0x7e/0x3c0
[428617.156484]  ? kmem_cache_alloc+0x13f/0x280
[428617.156486]  io_submit_one+0x131/0x3c0
[428617.156488]  __x64_sys_io_submit+0xa2/0x180
[428617.156489]  ? syscall_trace_enter+0x1ff/0x2d0
[428617.156492]  do_syscall_64+0x5b/0x1b0
[428617.156494]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[428617.156497] RIP: 0033:0x7fd1d50409bd
[428617.156499] Code: Unable to access opcode bytes at RIP 0x7fd1d5040993.
[428617.156500] RSP: 002b:00000000531c4da8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
[428617.156501] RAX: ffffffffffffffda RBX: 00007fd1d698ab78 RCX: 00007fd1d50409bd
[428617.156502] RDX: 00000000531c4de8 RSI: 0000000000000001 RDI: 00007fd1c4593000
[428617.156503] RBP: 00007fd1c4593000 R08: 00000001358df800 R09: 000000005667b8d8
[428617.156504] R10: ffffffffffffffe0 R11: 0000000000000246 R12: 0000000000000001
[428617.156505] R13: 0000000000000000 R14: 00000000531c4de8 R15: 000000005667b850
[..]
[428617.158050] Kernel panic - not syncing: hung_task: blocked tasks
[428617.158052] CPU: 7 PID: 126 Comm: khungtaskd Kdump: loaded Not tainted 4.18.0-425.10.1.el8_7.x86_64 #1
[428617.158053] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18227214.B64.2106252220 06/25/2021
[428617.158054] Call Trace:
[428617.158055]  dump_stack+0x41/0x60
[428617.158058]  panic+0xe7/0x2ac
[428617.158062]  watchdog+0x242/0x2f0
[428617.158065]  ? hungtask_pm_notify+0x50/0x50
[428617.158068]  kthread+0x10b/0x130
[428617.158070]  ? set_kthread_struct+0x50/0x50
[428617.158073]  ret_from_fork+0x1f/0x40

Resolution

  • This issue is currently being tracked in Red Hat private Bug ID #2179063
    [Private] Bug 2179063 - Database process hanging due to deadlock on ext4 filesystem mounted without a journal.

Root Cause

  • The symptoms are indicative of a known issue that has been addressed by upstream commit: 3eaf9cc62f447
$ git show 3eaf9cc62f447
commit 3eaf9cc62f447a742b26fa601993e94406aa1ea1
Author: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Date:   Tue Nov 5 23:02:23 2019 +1100

    ext4: update ext4_sync_file() to not use __generic_file_fsync()

    When the filesystem is created without a journal, we eventually call
    into __generic_file_fsync() in order to write out all the modified
    in-core data to the permanent storage device. This function happens to
    try and obtain an inode_lock() while synchronizing the files buffer
    and it's associated metadata.

    Generally, this is fine, however it becomes a problem when there is
    higher level code that has already obtained an inode_lock() as this
    leads to a recursive lock situation. This case is especially true when
    porting across direct I/O to iomap infrastructure as we obtain an
    inode_lock() early on in the I/O within ext4_dio_write_iter() and hold
    it until the I/O has been completed. Consequently, to not run into
    this specific issue, we move away from calling into
    __generic_file_fsync() and perform the necessary synchronization tasks
    within ext4_sync_file().

    Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
    Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Link: https://lore.kernel.org/r/3495f35ef67f2021b567e28e6f59222e583689b8.1572949325.git.mbobrowski@mbobrowski.org
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 5508baa11bb6..e10206e7f4bb 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -80,6 +80,43 @@ static int ext4_sync_parent(struct inode *inode)
        return ret;
 }

+static int ext4_fsync_nojournal(struct inode *inode, bool datasync,
+                               bool *needs_barrier)
+{
+       int ret, err;
+
+       ret = sync_mapping_buffers(inode->i_mapping);
+       if (!(inode->i_state & I_DIRTY_ALL))
+               return ret;
+       if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
+               return ret;
+
+       err = sync_inode_metadata(inode, 1);
+       if (!ret)
+               ret = err;
+
+       if (!ret)
+               ret = ext4_sync_parent(inode);
+       if (test_opt(inode->i_sb, BARRIER))
+               *needs_barrier = true;
+
+       return ret;
+}
+
+static int ext4_fsync_journal(struct inode *inode, bool datasync,
+                            bool *needs_barrier)
+{
+       struct ext4_inode_info *ei = EXT4_I(inode);
+       journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+       tid_t commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
+
+       if (journal->j_flags & JBD2_BARRIER &&
+           !jbd2_trans_will_send_data_barrier(journal, commit_tid))
+               *needs_barrier = true;
+
+       return jbd2_complete_transaction(journal, commit_tid);
+}
+
 /*
  * akpm: A new design for ext4_sync_file().
  *
@@ -91,17 +128,14 @@ static int ext4_sync_parent(struct inode *inode)
  * What we do is just kick off a commit and wait on it.  This will snapshot the
  * inode to disk.
  */
-
 int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 {
-       struct inode *inode = file->f_mapping->host;
-       struct ext4_inode_info *ei = EXT4_I(inode);
-       journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
        int ret = 0, err;
-       tid_t commit_tid;
        bool needs_barrier = false;
+       struct inode *inode = file->f_mapping->host;
+       struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);

-       if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+       if (unlikely(ext4_forced_shutdown(sbi)))
                return -EIO;

        J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -111,23 +145,15 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
        if (sb_rdonly(inode->i_sb)) {
                /* Make sure that we read updated s_mount_flags value */
                smp_rmb();
-               if (EXT4_SB(inode->i_sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
+               if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
                        ret = -EROFS;
                goto out;
        }

-       if (!journal) {
-               ret = __generic_file_fsync(file, start, end, datasync);
-               if (!ret)
-                       ret = ext4_sync_parent(inode);
-               if (test_opt(inode->i_sb, BARRIER))
-                       goto issue_flush;
-               goto out;
-       }
-
        ret = file_write_and_wait_range(file, start, end);
        if (ret)
                return ret;
+
        /*
         * data=writeback,ordered:
         *  The caller's filemap_fdatawrite()/wait will sync the data.
@@ -142,18 +168,14 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
         *  (they were dirtied by commit).  But that's OK - the blocks are
         *  safe in-journal, which is all fsync() needs to ensure.
         */
-       if (ext4_should_journal_data(inode)) {
+       if (!sbi->s_journal)
+               ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
+       else if (ext4_should_journal_data(inode))
                ret = ext4_force_commit(inode->i_sb);
-               goto out;
-       }
+       else
+               ret = ext4_fsync_journal(inode, datasync, &needs_barrier);

-       commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
-       if (journal->j_flags & JBD2_BARRIER &&
-           !jbd2_trans_will_send_data_barrier(journal, commit_tid))
-               needs_barrier = true;
-       ret = jbd2_complete_transaction(journal, commit_tid);
        if (needs_barrier) {
-       issue_flush:
                err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
                if (!ret)
                        ret = err;
</tytso@mit.edu></jack@suse.cz></riteshh@linux.ibm.com></mbobrowski@mbobrowski.org></mbobrowski@mbobrowski.org>

Diagnostic Steps

  • A crash dump (VMcore) captured by enabling sysctl hung_task_panic indicates that there is a deadlock between the oninit task PID 4327 waiting for ext4_inode_info->i_unwritten to reach 0 and a rescuer worker thread dio/dm-8 PID 4585 waiting for the rw_semaphore lock held by the "oninit" task PID 4327 waiting for "ext4_inode_info->i_unwritten" to reach 0.

Kernel Crash Dump Analysis:

  • System Information:
crash> sys | grep -e RELEASE -e PANIC
     RELEASE: 4.18.0-425.10.1.el8_7.x86_64
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"

crash> sys -i | grep DMI_SYS_VENDOR: -A 1
         DMI_SYS_VENDOR: VMware, Inc.
       DMI_PRODUCT_NAME: VMware7,1
  • Kernel Ring Buffer:
crash> log
[..]
[428617.156430] INFO: task oninit:4327 blocked for more than 120 seconds.
[428617.156431]       Not tainted 4.18.0-425.10.1.el8_7.x86_64 #1
[428617.156433] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[428617.156433] task:oninit          state:D stack:    0 pid: 4327 ppid:  4300 flags:0x00004084
[428617.156436] Call Trace:
[428617.156437]  __schedule+0x2d1/0x860
[428617.156439]  ? blk_flush_plug_list+0xd7/0x100
[428617.156441]  schedule+0x35/0xa0
[428617.156443]  ext4_unwritten_wait+0x87/0xc0 [ext4]
[428617.156460]  ? finish_wait+0x80/0x80
[428617.156462]  ext4_file_write_iter+0x396/0x3e0 [ext4]
[428617.156478]  aio_write+0xf6/0x1c0
[428617.156480]  ? blk_mq_sched_insert_requests+0x6c/0xf0
[428617.156483]  ? io_submit_one+0x7e/0x3c0
[428617.156484]  ? kmem_cache_alloc+0x13f/0x280
[428617.156486]  io_submit_one+0x131/0x3c0
[428617.156488]  __x64_sys_io_submit+0xa2/0x180
[428617.156489]  ? syscall_trace_enter+0x1ff/0x2d0
[428617.156492]  do_syscall_64+0x5b/0x1b0
[428617.156494]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[428617.156497] RIP: 0033:0x7fd1d50409bd
[428617.156499] Code: Unable to access opcode bytes at RIP 0x7fd1d5040993.
[428617.156500] RSP: 002b:00000000531c4da8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
[428617.156501] RAX: ffffffffffffffda RBX: 00007fd1d698ab78 RCX: 00007fd1d50409bd
[428617.156502] RDX: 00000000531c4de8 RSI: 0000000000000001 RDI: 00007fd1c4593000
[428617.156503] RBP: 00007fd1c4593000 R08: 00000001358df800 R09: 000000005667b8d8
[428617.156504] R10: ffffffffffffffe0 R11: 0000000000000246 R12: 0000000000000001
[428617.156505] R13: 0000000000000000 R14: 00000000531c4de8 R15: 000000005667b850
[..]
[428617.158050] Kernel panic - not syncing: hung_task: blocked tasks
[428617.158052] CPU: 7 PID: 126 Comm: khungtaskd Kdump: loaded Not tainted 4.18.0-425.10.1.el8_7.x86_64 #1
[428617.158053] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18227214.B64.2106252220 06/25/2021
[428617.158054] Call Trace:
[428617.158055]  dump_stack+0x41/0x60
[428617.158058]  panic+0xe7/0x2ac
[428617.158062]  watchdog+0x242/0x2f0
[428617.158065]  ? hungtask_pm_notify+0x50/0x50
[428617.158068]  kthread+0x10b/0x130
[428617.158070]  ? set_kthread_struct+0x50/0x50
[428617.158073]  ret_from_fork+0x1f/0x40
[..]
  • The oninit task PID: 4327 has been in the uninterruptible (UN) state for 2 minutes and 35 seconds.
crash> set 4327
    PID: 4327
COMMAND: "oninit"
   TASK: ffff9fc84c518000  [THREAD_INFO: ffff9fc84c518000]
    CPU: 9
  STATE: TASK_UNINTERRUPTIBLE 

crash> px ((struct file *)((struct task_struct *)0xffff9fc84c518000)->mm->exe_file)->f_path.dentry
$4 = (struct dentry *) 0xffff9fc8fc257e00

crash> files -d 0xffff9fc8fc257e00
     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff9fc8fc257e00 ffff9fc8fc178120 ffff9fc85014f000 REG  /users4/ifxdb_14.10.FC8_WE/bin/oninit

crash> ps -m 4327
[0 00:02:35.486] [UN]  PID: 4327     TASK: ffff9fc84c518000  CPU: 9    COMMAND: "oninit"
  • Backtrace of the oninit task PID: 4327
crash> bt
PID: 4327     TASK: ffff9fc84c518000  CPU: 9    COMMAND: "oninit"
 #0 [ffffb2fd13a77bf0] __schedule at ffffffffbb9eb121
 #1 [ffffb2fd13a77c80] schedule at ffffffffbb9eb6e5
 #2 [ffffb2fd13a77c90] ext4_unwritten_wait at ffffffffc08b7317 [ext4]
 #3 [ffffb2fd13a77cd8] ext4_file_write_iter at ffffffffc08b7b46 [ext4]
 #4 [ffffb2fd13a77d38] aio_write at ffffffffbb3b9e16
 #5 [ffffb2fd13a77e40] io_submit_one at ffffffffbb3ba1a1
 #6 [ffffb2fd13a77eb8] __x64_sys_io_submit at ffffffffbb3ba7b2
 #7 [ffffb2fd13a77f38] do_syscall_64 at ffffffffbb0043ab
 #8 [ffffb2fd13a77f50] entry_SYSCALL_64_after_hwframe at ffffffffbba000a9
    RIP: 00007fd1d50409bd  RSP: 00000000531c4da8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 00007fd1d698ab78  RCX: 00007fd1d50409bd
    RDX: 00000000531c4de8  RSI: 0000000000000001  RDI: 00007fd1c4593000
    RBP: 00007fd1c4593000   R8: 00000001358df800   R9: 000000005667b8d8
    R10: ffffffffffffffe0  R11: 0000000000000246  R12: 0000000000000001
    R13: 0000000000000000  R14: 00000000531c4de8  R15: 000000005667b850
    ORIG_RAX: 00000000000000d1  CS: 0033  SS: 002b
  • The oninit task PID: 4327 was waiting for an ext4_inode_info->i_unwritten to reach 0 for an ext4 inode:
crash> dis -rl ffffffffc08b7317 | tail -n 5
/usr/src/debug/kernel-4.18.0-425.10.1.el8_7/linux-4.18.0-425.10.1.el8_7.x86_64/fs/ext4/file.c: 111
0xffffffffc08b730b <ext4_unwritten_wait+123>:   call   0xffffffffbb1447b0 <init_wait_entry>
0xffffffffc08b7310 <ext4_unwritten_wait+128>:   jmp    0xffffffffc08b7317 <ext4_unwritten_wait+135>
0xffffffffc08b7312 <ext4_unwritten_wait+130>:   call   0xffffffffbb9eb6b0 <schedule>
0xffffffffc08b7317 <ext4_unwritten_wait+135>:   mov    $0x2,%edx

Kernel Source: fs/ext4/file.c
*****************************
218 static ssize_t
219 ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
220 {
221         struct inode *inode = file_inode(iocb->ki_filp);
222         int o_direct = iocb->ki_flags & IOCB_DIRECT;
223         int unaligned_aio = 0;
224         int overwrite = 0;
225         ssize_t ret;
...
273         /*
274          * Unaligned direct AIO must be the only IO in flight. Otherwise
275          * overlapping aligned IO after unaligned might result in data
276          * corruption.
277          */
278         if (ret == -EIOCBQUEUED && unaligned_aio)
279                 ext4_unwritten_wait(inode);
...

Kernel Source: fs/ext4/file.c
*****************************
107 static void ext4_unwritten_wait(struct inode *inode)
108 {
109         wait_queue_head_t *wq = ext4_ioend_wq(inode);
110 
111         wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));

crash> px ext4_unwritten_wait
ext4_unwritten_wait = $2 = 
 {void (struct inode *)} 0xffffffffc08b7290 <ext4_unwritten_wait>

crash> dis -r ffffffffc08b7b46 | tail -n 3
0xffffffffc08b7b3e <ext4_file_write_iter+910>:  mov    %rbp,%rdi
0xffffffffc08b7b41 <ext4_file_write_iter+913>:  call   0xffffffffc08b7290 <ext4_unwritten_wait>
0xffffffffc08b7b46 <ext4_file_write_iter+918>:  mov    %r14,%rdi

crash> bt -f | grep ext4_file_write_iter
 #3 [ffffb2fd13a77cd8] ext4_file_write_iter at ffffffffc08b7b46 [ext4]

crash> px (0xffffb2fd13a77cd8-0x8)
$5 = 0xffffb2fd13a77cd0

crash> rd 0xffffb2fd13a77cd0
ffffb2fd13a77cd0:  ffff9fca42897788                    .w.B....

o struct inode * == 0xffff9fca42897788

crash> struct ext4_inode_info.vfs_inode  -ox
struct ext4_inode_info {
   [0xe8] struct inode vfs_inode;
}

crash> px (0xffff9fca42897788-0xe8)
$6 = 0xffff9fca428976a0

o struct ext4_inode_info * == 0xffff9fca428976a0

crash> struct ext4_inode_info.i_unwritten 0xffff9fca428976a0
  i_unwritten = {
    counter = 1
  },

crash> struct inode.i_dentry 0xffff9fca42897788
    i_dentry = {
      first = 0xffff9fc85a8cc8f0
    },

crash> px (0xffff9fc85a8cc8f0-0xb0)
$7 = 0xffff9fc85a8cc840

o struct dentry * == 0xffff9fc85a8cc840

crash> files -d 0xffff9fc85a8cc840
     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff9fc85a8cc840 ffff9fca42897788 ffff9fc84bf55800 REG  /informix/data/data3dbs_1
  • The ext4 file system mounted without journalling:
crash> mount | grep -e NAME -e ffff9fc84bf55800
     MOUNT           SUPERBLK     TYPE   DEVNAME   DIRNAME
ffff9fc87136d080 ffff9fc84bf55800 ext4   /dev/mapper/rhel-informix /informix  <<< - - -

crash> px ((struct ext4_sb_info *)((struct super_block *)0xffff9fc84bf55800)->s_fs_info)->s_journal
$8 = (struct journal_s *) 0x0

crash> struct inode.i_sb 0xffff9fca42897788
  i_sb = 0xffff9fc84bf55800,

crash> struct super_block.s_bdev 0xffff9fc84bf55800
  s_bdev = 0xffff9fc880c32d00,

crash> struct block_device.bd_disk 0xffff9fc880c32d00
  bd_disk = 0xffff9fc8706a9000,

crash> struct gendisk.disk_name 0xffff9fc8706a9000
  disk_name = "dm-8\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",

crash> dmshow --lvs | grep -e NAME -e dm-8
LV DM-X DEV   LV NAME                                   VG NAME                                  OPEN COUNT        LV SIZE (MiB)     PV NAME
dm-8          informix                                  rhel                                              1            512000.00     sda3

crash> dev -d | grep -e NAME -e dm-8
MAJOR GENDISK            NAME       REQUEST_QUEUE      TOTAL ASYNC  SYNC 
  253 ffff9fc8706a9000   dm-8       ffff9fc848c21fb0       0     0     0 
  • Let's verify the rescuer worker thread dio/dm-8 PID: 4585
crash> set 4585
    PID: 4585
COMMAND: "dio/dm-8"
   TASK: ffff9fca0d9ea800  [THREAD_INFO: ffff9fca0d9ea800]
    CPU: 4
  STATE: TASK_UNINTERRUPTIBLE 
  • It has been in the uninterruptible (UN) state for 2 minutes and 35 seconds.
crash> ps -m 4585
[0 00:02:35.436] [UN]  PID: 4585     TASK: ffff9fca0d9ea800  CPU: 4    COMMAND: "dio/dm-8"
  • Backtrace of the dio/dm-8 thread PID: 4585
crash> bt
PID: 4585     TASK: ffff9fca0d9ea800  CPU: 4    COMMAND: "dio/dm-8"
 #0 [ffffb2fd13bd7c98] __schedule at ffffffffbb9eb121
 #1 [ffffb2fd13bd7d28] schedule at ffffffffbb9eb6e5
 #2 [ffffb2fd13bd7d38] rwsem_down_write_slowpath at ffffffffbb15503c
 #3 [ffffb2fd13bd7dc0] __generic_file_fsync at ffffffffbb389f92
 #4 [ffffb2fd13bd7df0] ext4_sync_file at ffffffffc08b9310 [ext4]
 #5 [ffffb2fd13bd7e30] dio_complete at ffffffffbb3a5353
 #6 [ffffb2fd13bd7e68] process_one_work at ffffffffbb110767
 #7 [ffffb2fd13bd7ea8] rescuer_thread at ffffffffbb110b1b
 #8 [ffffb2fd13bd7f10] kthread at ffffffffbb117f8b
 #9 [ffffb2fd13bd7f50] ret_from_fork at ffffffffbba0024f
  • It was waiting for the rw_semaphore (inode->i_rwsem) lock.
crash> dis -rl ffffffffbb389f92 | tail -n 6
/usr/src/debug/kernel-4.18.0-425.10.1.el8_7/linux-4.18.0-425.10.1.el8_7.x86_64/./include/linux/fs.h: 787
0xffffffffbb389f83 <__generic_file_fsync+51>:   lea    0xa8(%rbp),%r14
0xffffffffbb389f8a <__generic_file_fsync+58>:   mov    %r14,%rdi
0xffffffffbb389f8d <__generic_file_fsync+61>:   call   0xffffffffbb9edd50 <down_write>
/usr/src/debug/kernel-4.18.0-425.10.1.el8_7/linux-4.18.0-425.10.1.el8_7.x86_64/fs/libfs.c: 1004
0xffffffffbb389f92 <__generic_file_fsync+66>:   mov    0x30(%rbp),%rdi

Kernel Source: fs/libfs.c
*************************
 992 int __generic_file_fsync(struct file *file, loff_t start, loff_t end,
 993                                  int datasync)
 994 {
 995         struct inode *inode = file->f_mapping->host;
 996         int err;
 997         int ret;
 998 
 999         err = file_write_and_wait_range(file, start, end);
1000         if (err)
1001                 return err;
1002 
1003         inode_lock(inode);

Kernel Source: include/linux/fs.h
*********************************
785 static inline void inode_lock(struct inode *inode)
786 {
787         down_write(&inode->i_rwsem);

crash> dis -r ffffffffc08b9310 | tail -n 3
0xffffffffc08b9308 <ext4_sync_file+456>:    mov    %rbx,%rdi
0xffffffffc08b930b <ext4_sync_file+459>:    call   0xffffffffbb389f50 <__generic_file_fsync>
0xffffffffc08b9310 <ext4_sync_file+464>:    mov    %eax,%r13d

crash> dis -r ffffffffbb389f92 | head -n 7
0xffffffffbb389f50 <__generic_file_fsync>:  nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffbb389f55 <__generic_file_fsync+5>:    push   %r14
0xffffffffbb389f57 <__generic_file_fsync+7>:    push   %r13
0xffffffffbb389f59 <__generic_file_fsync+9>:    push   %r12
0xffffffffbb389f5b <__generic_file_fsync+11>:   mov    %ecx,%r12d
0xffffffffbb389f5e <__generic_file_fsync+14>:   push   %rbp
0xffffffffbb389f5f <__generic_file_fsync+15>:   push   %rbx

crash> bt -f | grep __generic_file_fsync -A 4
 #3 [ffffb2fd13bd7dc0] __generic_file_fsync at ffffffffbb389f92
    ffffb2fd13bd7dc8: ffff9fc90b9dde00 ffff9fca42897788 
    ffffb2fd13bd7dd8: 0000000000000001 0000000000000800 
    ffffb2fd13bd7de8: 000000008b6517ff ffffffffc08b9310 
 #4 [ffffb2fd13bd7df0] ext4_sync_file at ffffffffc08b9310 [ext4]

o struct file * == 0xffff9fc90b9dde00

crash> px ((struct file *)0xffff9fc90b9dde00)->f_mapping.host
$9 = (struct inode *) 0xffff9fca42897788

o struct inode * == 0xffff9fca42897788

crash> struct inode.i_dentry 0xffff9fca42897788
    i_dentry = {
      first = 0xffff9fc85a8cc8f0
    },

crash> px (0xffff9fc85a8cc8f0-0xb0)
$10 = 0xffff9fc85a8cc840

o struct dentry * == 0xffff9fc85a8cc840

crash> files -d 0xffff9fc85a8cc840
     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff9fc85a8cc840 ffff9fca42897788 ffff9fc84bf55800 REG  /informix/data/data3dbs_1

crash> struct inode.i_rwsem 0xffff9fca42897788 -ox
struct inode {
  [ffff9fca42897830] struct rw_semaphore i_rwsem;
}

o struct rw_semaphore * == 0xffff9fca42897830

crash> struct rw_semaphore.wait_list 0xffff9fca42897830 -ox
struct rw_semaphore {
  [ffff9fca42897838] struct list_head wait_list;
}
  • The waitlist of the rw_semaphore has 6 tasks.
crash> list -H 0xffff9fca42897838
ffffb2fd066abd90
ffffb2fd065dfd90
ffffb2fd194efd90
ffffb2fd17a47d90
ffffb2fd1b2dfd90
ffffb2fd13bd7d60
  • List of the task waiting for the rw_semaphore:
crash> list -H 0xffff9fca42897838 -s rwsem_waiter.task | grep task | tr -d , | awk '{print $3}' > rw_task

crash> ps -m < rw_task
[0 00:02:35.467] [UN]  PID: 3613198  TASK: ffff9fdee0a50000  CPU: 4    COMMAND: "kworker/4:5"
[0 00:02:35.467] [UN]  PID: 3613196  TASK: ffff9fe291838000  CPU: 4    COMMAND: "kworker/4:3"
[0 00:02:35.467] [UN]  PID: 3606261  TASK: ffff9fd99a8aa800  CPU: 4    COMMAND: "kworker/4:7"
[0 00:02:35.467] [UN]  PID: 3609395  TASK: ffff9fe68de85000  CPU: 4    COMMAND: "kworker/4:1"
[0 00:02:35.467] [UN]  PID: 3612070  TASK: ffff9fd2ae145000  CPU: 4    COMMAND: "kworker/4:0"
[0 00:02:35.436] [UN]  PID: 4585     TASK: ffff9fca0d9ea800  CPU: 4    COMMAND: "dio/dm-8"

crash> list -H 0xffff9fca42897838 -o rwsem_waiter.list -s rwsem_waiter.type | grep type | awk '{print $3}' | sort | uniq -c
      6 RWSEM_WAITING_FOR_WRITE,
  • The rw_semaphore (inode->i_rwsem) was held by the oninit task PID: 4327
crash> px ((struct rw_semaphore *)0xffff9fca42897830)->owner
$11 = {
  counter = 0xffff9fc84c518000
}

o struct task_struct * == 0xffff9fc84c518000

crash> task_struct.pid 0xffff9fc84c518000
  pid = 4327,

crash> ps 4327
      PID    PPID  CPU       TASK        ST  %MEM      VSZ      RSS  COMM
     4327    4300   9  ffff9fc84c518000  UN   0.0 29594148    25704  oninit

crash> set 4327
    PID: 4327
COMMAND: "oninit"
   TASK: ffff9fc84c518000  [THREAD_INFO: ffff9fc84c518000]
    CPU: 9
  STATE: TASK_UNINTERRUPTIBLE 
  • It has acquired the rw_semaphore in the ext4_file_write_iter() function.
crash> bt
PID: 4327     TASK: ffff9fc84c518000  CPU: 9    COMMAND: "oninit"
 #0 [ffffb2fd13a77bf0] __schedule at ffffffffbb9eb121
 #1 [ffffb2fd13a77c80] schedule at ffffffffbb9eb6e5
 #2 [ffffb2fd13a77c90] ext4_unwritten_wait at ffffffffc08b7317 [ext4]
 #3 [ffffb2fd13a77cd8] ext4_file_write_iter at ffffffffc08b7b46 [ext4]
 #4 [ffffb2fd13a77d38] aio_write at ffffffffbb3b9e16
 #5 [ffffb2fd13a77e40] io_submit_one at ffffffffbb3ba1a1
 #6 [ffffb2fd13a77eb8] __x64_sys_io_submit at ffffffffbb3ba7b2
 #7 [ffffb2fd13a77f38] do_syscall_64 at ffffffffbb0043ab
 #8 [ffffb2fd13a77f50] entry_SYSCALL_64_after_hwframe at ffffffffbba000a9
    RIP: 00007fd1d50409bd  RSP: 00000000531c4da8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 00007fd1d698ab78  RCX: 00007fd1d50409bd
    RDX: 00000000531c4de8  RSI: 0000000000000001  RDI: 00007fd1c4593000
    RBP: 00007fd1c4593000   R8: 00000001358df800   R9: 000000005667b8d8
    R10: ffffffffffffffe0  R11: 0000000000000246  R12: 0000000000000001
    R13: 0000000000000000  R14: 00000000531c4de8  R15: 000000005667b850
    ORIG_RAX: 00000000000000d1  CS: 0033  SS: 002b
  • The oninit task PID: 4327 and dio/dm-8 thread PID: 4585 were deadlocked.
 

https://access.redhat.com/solutions/7006119

 

 

728x90

+ Recent posts