Environment
- Red Hat Enterprise Linux release 8.7 (Ootpa)
- 4.18.0-425.10.1.el8_7.x86_64
- EXT4 filesystem without a journal.
Issue
- IBM Informix Dynamic Server oninit process hanging, resulting in unresponsive/hung database connections due to deadlock on ext4 filesystem mounted without a journal.
- IBM Informix Dynamic Server oninit process blocked with the following call traces.
[428617.156430] INFO: task oninit:4327 blocked for more than 120 seconds.
[428617.156431] Not tainted 4.18.0-425.10.1.el8_7.x86_64 #1
[428617.156433] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[428617.156433] task:oninit state:D stack: 0 pid: 4327 ppid: 4300 flags:0x00004084
[428617.156436] Call Trace:
[428617.156437] __schedule+0x2d1/0x860
[428617.156439] ? blk_flush_plug_list+0xd7/0x100
[428617.156441] schedule+0x35/0xa0
[428617.156443] ext4_unwritten_wait+0x87/0xc0 [ext4]
[428617.156460] ? finish_wait+0x80/0x80
[428617.156462] ext4_file_write_iter+0x396/0x3e0 [ext4]
[428617.156478] aio_write+0xf6/0x1c0
[428617.156480] ? blk_mq_sched_insert_requests+0x6c/0xf0
[428617.156483] ? io_submit_one+0x7e/0x3c0
[428617.156484] ? kmem_cache_alloc+0x13f/0x280
[428617.156486] io_submit_one+0x131/0x3c0
[428617.156488] __x64_sys_io_submit+0xa2/0x180
[428617.156489] ? syscall_trace_enter+0x1ff/0x2d0
[428617.156492] do_syscall_64+0x5b/0x1b0
[428617.156494] entry_SYSCALL_64_after_hwframe+0x61/0xc6
[428617.156497] RIP: 0033:0x7fd1d50409bd
[428617.156499] Code: Unable to access opcode bytes at RIP 0x7fd1d5040993.
[428617.156500] RSP: 002b:00000000531c4da8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
[428617.156501] RAX: ffffffffffffffda RBX: 00007fd1d698ab78 RCX: 00007fd1d50409bd
[428617.156502] RDX: 00000000531c4de8 RSI: 0000000000000001 RDI: 00007fd1c4593000
[428617.156503] RBP: 00007fd1c4593000 R08: 00000001358df800 R09: 000000005667b8d8
[428617.156504] R10: ffffffffffffffe0 R11: 0000000000000246 R12: 0000000000000001
[428617.156505] R13: 0000000000000000 R14: 00000000531c4de8 R15: 000000005667b850
[..]
[428617.158050] Kernel panic - not syncing: hung_task: blocked tasks
[428617.158052] CPU: 7 PID: 126 Comm: khungtaskd Kdump: loaded Not tainted 4.18.0-425.10.1.el8_7.x86_64 #1
[428617.158053] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18227214.B64.2106252220 06/25/2021
[428617.158054] Call Trace:
[428617.158055] dump_stack+0x41/0x60
[428617.158058] panic+0xe7/0x2ac
[428617.158062] watchdog+0x242/0x2f0
[428617.158065] ? hungtask_pm_notify+0x50/0x50
[428617.158068] kthread+0x10b/0x130
[428617.158070] ? set_kthread_struct+0x50/0x50
[428617.158073] ret_from_fork+0x1f/0x40
Resolution
- This issue is currently being tracked in Red Hat private Bug ID #2179063
[Private] Bug 2179063 - Database process hanging due to deadlock on ext4 filesystem mounted without a journal.
Root Cause
- The symptoms are indicative of a known issue that has been addressed by upstream commit: 3eaf9cc62f447
$ git show 3eaf9cc62f447
commit 3eaf9cc62f447a742b26fa601993e94406aa1ea1
Author: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Date: Tue Nov 5 23:02:23 2019 +1100
ext4: update ext4_sync_file() to not use __generic_file_fsync()
When the filesystem is created without a journal, we eventually call
into __generic_file_fsync() in order to write out all the modified
in-core data to the permanent storage device. This function happens to
try and obtain an inode_lock() while synchronizing the files buffer
and it's associated metadata.
Generally, this is fine, however it becomes a problem when there is
higher level code that has already obtained an inode_lock() as this
leads to a recursive lock situation. This case is especially true when
porting across direct I/O to iomap infrastructure as we obtain an
inode_lock() early on in the I/O within ext4_dio_write_iter() and hold
it until the I/O has been completed. Consequently, to not run into
this specific issue, we move away from calling into
__generic_file_fsync() and perform the necessary synchronization tasks
within ext4_sync_file().
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/3495f35ef67f2021b567e28e6f59222e583689b8.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 5508baa11bb6..e10206e7f4bb 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -80,6 +80,43 @@ static int ext4_sync_parent(struct inode *inode)
return ret;
}
+static int ext4_fsync_nojournal(struct inode *inode, bool datasync,
+ bool *needs_barrier)
+{
+ int ret, err;
+
+ ret = sync_mapping_buffers(inode->i_mapping);
+ if (!(inode->i_state & I_DIRTY_ALL))
+ return ret;
+ if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
+ return ret;
+
+ err = sync_inode_metadata(inode, 1);
+ if (!ret)
+ ret = err;
+
+ if (!ret)
+ ret = ext4_sync_parent(inode);
+ if (test_opt(inode->i_sb, BARRIER))
+ *needs_barrier = true;
+
+ return ret;
+}
+
+static int ext4_fsync_journal(struct inode *inode, bool datasync,
+ bool *needs_barrier)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ tid_t commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
+
+ if (journal->j_flags & JBD2_BARRIER &&
+ !jbd2_trans_will_send_data_barrier(journal, commit_tid))
+ *needs_barrier = true;
+
+ return jbd2_complete_transaction(journal, commit_tid);
+}
+
/*
* akpm: A new design for ext4_sync_file().
*
@@ -91,17 +128,14 @@ static int ext4_sync_parent(struct inode *inode)
* What we do is just kick off a commit and wait on it. This will snapshot the
* inode to disk.
*/
-
int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
- struct inode *inode = file->f_mapping->host;
- struct ext4_inode_info *ei = EXT4_I(inode);
- journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
int ret = 0, err;
- tid_t commit_tid;
bool needs_barrier = false;
+ struct inode *inode = file->f_mapping->host;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+ if (unlikely(ext4_forced_shutdown(sbi)))
return -EIO;
J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -111,23 +145,15 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
if (sb_rdonly(inode->i_sb)) {
/* Make sure that we read updated s_mount_flags value */
smp_rmb();
- if (EXT4_SB(inode->i_sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
+ if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
ret = -EROFS;
goto out;
}
- if (!journal) {
- ret = __generic_file_fsync(file, start, end, datasync);
- if (!ret)
- ret = ext4_sync_parent(inode);
- if (test_opt(inode->i_sb, BARRIER))
- goto issue_flush;
- goto out;
- }
-
ret = file_write_and_wait_range(file, start, end);
if (ret)
return ret;
+
/*
* data=writeback,ordered:
* The caller's filemap_fdatawrite()/wait will sync the data.
@@ -142,18 +168,14 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
* (they were dirtied by commit). But that's OK - the blocks are
* safe in-journal, which is all fsync() needs to ensure.
*/
- if (ext4_should_journal_data(inode)) {
+ if (!sbi->s_journal)
+ ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
+ else if (ext4_should_journal_data(inode))
ret = ext4_force_commit(inode->i_sb);
- goto out;
- }
+ else
+ ret = ext4_fsync_journal(inode, datasync, &needs_barrier);
- commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
- if (journal->j_flags & JBD2_BARRIER &&
- !jbd2_trans_will_send_data_barrier(journal, commit_tid))
- needs_barrier = true;
- ret = jbd2_complete_transaction(journal, commit_tid);
if (needs_barrier) {
- issue_flush:
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (!ret)
ret = err;
</tytso@mit.edu></jack@suse.cz></riteshh@linux.ibm.com></mbobrowski@mbobrowski.org></mbobrowski@mbobrowski.org>
Diagnostic Steps
- A crash dump (VMcore) captured by enabling sysctl hung_task_panic indicates that there is a deadlock between the oninit task PID 4327 waiting for ext4_inode_info->i_unwritten to reach 0 and a rescuer worker thread dio/dm-8 PID 4585 waiting for the rw_semaphore lock held by the "oninit" task PID 4327 waiting for "ext4_inode_info->i_unwritten" to reach 0.
Kernel Crash Dump Analysis:
crash> sys | grep -e RELEASE -e PANIC
RELEASE: 4.18.0-425.10.1.el8_7.x86_64
PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
crash> sys -i | grep DMI_SYS_VENDOR: -A 1
DMI_SYS_VENDOR: VMware, Inc.
DMI_PRODUCT_NAME: VMware7,1
crash> log
[..]
[428617.156430] INFO: task oninit:4327 blocked for more than 120 seconds.
[428617.156431] Not tainted 4.18.0-425.10.1.el8_7.x86_64 #1
[428617.156433] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[428617.156433] task:oninit state:D stack: 0 pid: 4327 ppid: 4300 flags:0x00004084
[428617.156436] Call Trace:
[428617.156437] __schedule+0x2d1/0x860
[428617.156439] ? blk_flush_plug_list+0xd7/0x100
[428617.156441] schedule+0x35/0xa0
[428617.156443] ext4_unwritten_wait+0x87/0xc0 [ext4]
[428617.156460] ? finish_wait+0x80/0x80
[428617.156462] ext4_file_write_iter+0x396/0x3e0 [ext4]
[428617.156478] aio_write+0xf6/0x1c0
[428617.156480] ? blk_mq_sched_insert_requests+0x6c/0xf0
[428617.156483] ? io_submit_one+0x7e/0x3c0
[428617.156484] ? kmem_cache_alloc+0x13f/0x280
[428617.156486] io_submit_one+0x131/0x3c0
[428617.156488] __x64_sys_io_submit+0xa2/0x180
[428617.156489] ? syscall_trace_enter+0x1ff/0x2d0
[428617.156492] do_syscall_64+0x5b/0x1b0
[428617.156494] entry_SYSCALL_64_after_hwframe+0x61/0xc6
[428617.156497] RIP: 0033:0x7fd1d50409bd
[428617.156499] Code: Unable to access opcode bytes at RIP 0x7fd1d5040993.
[428617.156500] RSP: 002b:00000000531c4da8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
[428617.156501] RAX: ffffffffffffffda RBX: 00007fd1d698ab78 RCX: 00007fd1d50409bd
[428617.156502] RDX: 00000000531c4de8 RSI: 0000000000000001 RDI: 00007fd1c4593000
[428617.156503] RBP: 00007fd1c4593000 R08: 00000001358df800 R09: 000000005667b8d8
[428617.156504] R10: ffffffffffffffe0 R11: 0000000000000246 R12: 0000000000000001
[428617.156505] R13: 0000000000000000 R14: 00000000531c4de8 R15: 000000005667b850
[..]
[428617.158050] Kernel panic - not syncing: hung_task: blocked tasks
[428617.158052] CPU: 7 PID: 126 Comm: khungtaskd Kdump: loaded Not tainted 4.18.0-425.10.1.el8_7.x86_64 #1
[428617.158053] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18227214.B64.2106252220 06/25/2021
[428617.158054] Call Trace:
[428617.158055] dump_stack+0x41/0x60
[428617.158058] panic+0xe7/0x2ac
[428617.158062] watchdog+0x242/0x2f0
[428617.158065] ? hungtask_pm_notify+0x50/0x50
[428617.158068] kthread+0x10b/0x130
[428617.158070] ? set_kthread_struct+0x50/0x50
[428617.158073] ret_from_fork+0x1f/0x40
[..]
- The oninit task PID: 4327 has been in the uninterruptible (UN) state for 2 minutes and 35 seconds.
crash> set 4327
PID: 4327
COMMAND: "oninit"
TASK: ffff9fc84c518000 [THREAD_INFO: ffff9fc84c518000]
CPU: 9
STATE: TASK_UNINTERRUPTIBLE
crash> px ((struct file *)((struct task_struct *)0xffff9fc84c518000)->mm->exe_file)->f_path.dentry
$4 = (struct dentry *) 0xffff9fc8fc257e00
crash> files -d 0xffff9fc8fc257e00
DENTRY INODE SUPERBLK TYPE PATH
ffff9fc8fc257e00 ffff9fc8fc178120 ffff9fc85014f000 REG /users4/ifxdb_14.10.FC8_WE/bin/oninit
crash> ps -m 4327
[0 00:02:35.486] [UN] PID: 4327 TASK: ffff9fc84c518000 CPU: 9 COMMAND: "oninit"
- Backtrace of the oninit task PID: 4327
crash> bt
PID: 4327 TASK: ffff9fc84c518000 CPU: 9 COMMAND: "oninit"
#0 [ffffb2fd13a77bf0] __schedule at ffffffffbb9eb121
#1 [ffffb2fd13a77c80] schedule at ffffffffbb9eb6e5
#2 [ffffb2fd13a77c90] ext4_unwritten_wait at ffffffffc08b7317 [ext4]
#3 [ffffb2fd13a77cd8] ext4_file_write_iter at ffffffffc08b7b46 [ext4]
#4 [ffffb2fd13a77d38] aio_write at ffffffffbb3b9e16
#5 [ffffb2fd13a77e40] io_submit_one at ffffffffbb3ba1a1
#6 [ffffb2fd13a77eb8] __x64_sys_io_submit at ffffffffbb3ba7b2
#7 [ffffb2fd13a77f38] do_syscall_64 at ffffffffbb0043ab
#8 [ffffb2fd13a77f50] entry_SYSCALL_64_after_hwframe at ffffffffbba000a9
RIP: 00007fd1d50409bd RSP: 00000000531c4da8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00007fd1d698ab78 RCX: 00007fd1d50409bd
RDX: 00000000531c4de8 RSI: 0000000000000001 RDI: 00007fd1c4593000
RBP: 00007fd1c4593000 R8: 00000001358df800 R9: 000000005667b8d8
R10: ffffffffffffffe0 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 00000000531c4de8 R15: 000000005667b850
ORIG_RAX: 00000000000000d1 CS: 0033 SS: 002b
- The oninit task PID: 4327 was waiting for an ext4_inode_info->i_unwritten to reach 0 for an ext4 inode:
crash> dis -rl ffffffffc08b7317 | tail -n 5
/usr/src/debug/kernel-4.18.0-425.10.1.el8_7/linux-4.18.0-425.10.1.el8_7.x86_64/fs/ext4/file.c: 111
0xffffffffc08b730b <ext4_unwritten_wait+123>: call 0xffffffffbb1447b0 <init_wait_entry>
0xffffffffc08b7310 <ext4_unwritten_wait+128>: jmp 0xffffffffc08b7317 <ext4_unwritten_wait+135>
0xffffffffc08b7312 <ext4_unwritten_wait+130>: call 0xffffffffbb9eb6b0 <schedule>
0xffffffffc08b7317 <ext4_unwritten_wait+135>: mov $0x2,%edx
Kernel Source: fs/ext4/file.c
*****************************
218 static ssize_t
219 ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
220 {
221 struct inode *inode = file_inode(iocb->ki_filp);
222 int o_direct = iocb->ki_flags & IOCB_DIRECT;
223 int unaligned_aio = 0;
224 int overwrite = 0;
225 ssize_t ret;
...
273 /*
274 * Unaligned direct AIO must be the only IO in flight. Otherwise
275 * overlapping aligned IO after unaligned might result in data
276 * corruption.
277 */
278 if (ret == -EIOCBQUEUED && unaligned_aio)
279 ext4_unwritten_wait(inode);
...
Kernel Source: fs/ext4/file.c
*****************************
107 static void ext4_unwritten_wait(struct inode *inode)
108 {
109 wait_queue_head_t *wq = ext4_ioend_wq(inode);
110
111 wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
crash> px ext4_unwritten_wait
ext4_unwritten_wait = $2 =
{void (struct inode *)} 0xffffffffc08b7290 <ext4_unwritten_wait>
crash> dis -r ffffffffc08b7b46 | tail -n 3
0xffffffffc08b7b3e <ext4_file_write_iter+910>: mov %rbp,%rdi
0xffffffffc08b7b41 <ext4_file_write_iter+913>: call 0xffffffffc08b7290 <ext4_unwritten_wait>
0xffffffffc08b7b46 <ext4_file_write_iter+918>: mov %r14,%rdi
crash> bt -f | grep ext4_file_write_iter
#3 [ffffb2fd13a77cd8] ext4_file_write_iter at ffffffffc08b7b46 [ext4]
crash> px (0xffffb2fd13a77cd8-0x8)
$5 = 0xffffb2fd13a77cd0
crash> rd 0xffffb2fd13a77cd0
ffffb2fd13a77cd0: ffff9fca42897788 .w.B....
o struct inode * == 0xffff9fca42897788
crash> struct ext4_inode_info.vfs_inode -ox
struct ext4_inode_info {
[0xe8] struct inode vfs_inode;
}
crash> px (0xffff9fca42897788-0xe8)
$6 = 0xffff9fca428976a0
o struct ext4_inode_info * == 0xffff9fca428976a0
crash> struct ext4_inode_info.i_unwritten 0xffff9fca428976a0
i_unwritten = {
counter = 1
},
crash> struct inode.i_dentry 0xffff9fca42897788
i_dentry = {
first = 0xffff9fc85a8cc8f0
},
crash> px (0xffff9fc85a8cc8f0-0xb0)
$7 = 0xffff9fc85a8cc840
o struct dentry * == 0xffff9fc85a8cc840
crash> files -d 0xffff9fc85a8cc840
DENTRY INODE SUPERBLK TYPE PATH
ffff9fc85a8cc840 ffff9fca42897788 ffff9fc84bf55800 REG /informix/data/data3dbs_1
- The ext4 file system mounted without journalling:
crash> mount | grep -e NAME -e ffff9fc84bf55800
MOUNT SUPERBLK TYPE DEVNAME DIRNAME
ffff9fc87136d080 ffff9fc84bf55800 ext4 /dev/mapper/rhel-informix /informix <<< - - -
crash> px ((struct ext4_sb_info *)((struct super_block *)0xffff9fc84bf55800)->s_fs_info)->s_journal
$8 = (struct journal_s *) 0x0
crash> struct inode.i_sb 0xffff9fca42897788
i_sb = 0xffff9fc84bf55800,
crash> struct super_block.s_bdev 0xffff9fc84bf55800
s_bdev = 0xffff9fc880c32d00,
crash> struct block_device.bd_disk 0xffff9fc880c32d00
bd_disk = 0xffff9fc8706a9000,
crash> struct gendisk.disk_name 0xffff9fc8706a9000
disk_name = "dm-8\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
crash> dmshow --lvs | grep -e NAME -e dm-8
LV DM-X DEV LV NAME VG NAME OPEN COUNT LV SIZE (MiB) PV NAME
dm-8 informix rhel 1 512000.00 sda3
crash> dev -d | grep -e NAME -e dm-8
MAJOR GENDISK NAME REQUEST_QUEUE TOTAL ASYNC SYNC
253 ffff9fc8706a9000 dm-8 ffff9fc848c21fb0 0 0 0
- Let's verify the rescuer worker thread dio/dm-8 PID: 4585
crash> set 4585
PID: 4585
COMMAND: "dio/dm-8"
TASK: ffff9fca0d9ea800 [THREAD_INFO: ffff9fca0d9ea800]
CPU: 4
STATE: TASK_UNINTERRUPTIBLE
- It has been in the uninterruptible (UN) state for 2 minutes and 35 seconds.
crash> ps -m 4585
[0 00:02:35.436] [UN] PID: 4585 TASK: ffff9fca0d9ea800 CPU: 4 COMMAND: "dio/dm-8"
- Backtrace of the dio/dm-8 thread PID: 4585
crash> bt
PID: 4585 TASK: ffff9fca0d9ea800 CPU: 4 COMMAND: "dio/dm-8"
#0 [ffffb2fd13bd7c98] __schedule at ffffffffbb9eb121
#1 [ffffb2fd13bd7d28] schedule at ffffffffbb9eb6e5
#2 [ffffb2fd13bd7d38] rwsem_down_write_slowpath at ffffffffbb15503c
#3 [ffffb2fd13bd7dc0] __generic_file_fsync at ffffffffbb389f92
#4 [ffffb2fd13bd7df0] ext4_sync_file at ffffffffc08b9310 [ext4]
#5 [ffffb2fd13bd7e30] dio_complete at ffffffffbb3a5353
#6 [ffffb2fd13bd7e68] process_one_work at ffffffffbb110767
#7 [ffffb2fd13bd7ea8] rescuer_thread at ffffffffbb110b1b
#8 [ffffb2fd13bd7f10] kthread at ffffffffbb117f8b
#9 [ffffb2fd13bd7f50] ret_from_fork at ffffffffbba0024f
- It was waiting for the rw_semaphore (inode->i_rwsem) lock.
crash> dis -rl ffffffffbb389f92 | tail -n 6
/usr/src/debug/kernel-4.18.0-425.10.1.el8_7/linux-4.18.0-425.10.1.el8_7.x86_64/./include/linux/fs.h: 787
0xffffffffbb389f83 <__generic_file_fsync+51>: lea 0xa8(%rbp),%r14
0xffffffffbb389f8a <__generic_file_fsync+58>: mov %r14,%rdi
0xffffffffbb389f8d <__generic_file_fsync+61>: call 0xffffffffbb9edd50 <down_write>
/usr/src/debug/kernel-4.18.0-425.10.1.el8_7/linux-4.18.0-425.10.1.el8_7.x86_64/fs/libfs.c: 1004
0xffffffffbb389f92 <__generic_file_fsync+66>: mov 0x30(%rbp),%rdi
Kernel Source: fs/libfs.c
*************************
992 int __generic_file_fsync(struct file *file, loff_t start, loff_t end,
993 int datasync)
994 {
995 struct inode *inode = file->f_mapping->host;
996 int err;
997 int ret;
998
999 err = file_write_and_wait_range(file, start, end);
1000 if (err)
1001 return err;
1002
1003 inode_lock(inode);
Kernel Source: include/linux/fs.h
*********************************
785 static inline void inode_lock(struct inode *inode)
786 {
787 down_write(&inode->i_rwsem);
crash> dis -r ffffffffc08b9310 | tail -n 3
0xffffffffc08b9308 <ext4_sync_file+456>: mov %rbx,%rdi
0xffffffffc08b930b <ext4_sync_file+459>: call 0xffffffffbb389f50 <__generic_file_fsync>
0xffffffffc08b9310 <ext4_sync_file+464>: mov %eax,%r13d
crash> dis -r ffffffffbb389f92 | head -n 7
0xffffffffbb389f50 <__generic_file_fsync>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffbb389f55 <__generic_file_fsync+5>: push %r14
0xffffffffbb389f57 <__generic_file_fsync+7>: push %r13
0xffffffffbb389f59 <__generic_file_fsync+9>: push %r12
0xffffffffbb389f5b <__generic_file_fsync+11>: mov %ecx,%r12d
0xffffffffbb389f5e <__generic_file_fsync+14>: push %rbp
0xffffffffbb389f5f <__generic_file_fsync+15>: push %rbx
crash> bt -f | grep __generic_file_fsync -A 4
#3 [ffffb2fd13bd7dc0] __generic_file_fsync at ffffffffbb389f92
ffffb2fd13bd7dc8: ffff9fc90b9dde00 ffff9fca42897788
ffffb2fd13bd7dd8: 0000000000000001 0000000000000800
ffffb2fd13bd7de8: 000000008b6517ff ffffffffc08b9310
#4 [ffffb2fd13bd7df0] ext4_sync_file at ffffffffc08b9310 [ext4]
o struct file * == 0xffff9fc90b9dde00
crash> px ((struct file *)0xffff9fc90b9dde00)->f_mapping.host
$9 = (struct inode *) 0xffff9fca42897788
o struct inode * == 0xffff9fca42897788
crash> struct inode.i_dentry 0xffff9fca42897788
i_dentry = {
first = 0xffff9fc85a8cc8f0
},
crash> px (0xffff9fc85a8cc8f0-0xb0)
$10 = 0xffff9fc85a8cc840
o struct dentry * == 0xffff9fc85a8cc840
crash> files -d 0xffff9fc85a8cc840
DENTRY INODE SUPERBLK TYPE PATH
ffff9fc85a8cc840 ffff9fca42897788 ffff9fc84bf55800 REG /informix/data/data3dbs_1
crash> struct inode.i_rwsem 0xffff9fca42897788 -ox
struct inode {
[ffff9fca42897830] struct rw_semaphore i_rwsem;
}
o struct rw_semaphore * == 0xffff9fca42897830
crash> struct rw_semaphore.wait_list 0xffff9fca42897830 -ox
struct rw_semaphore {
[ffff9fca42897838] struct list_head wait_list;
}
- The waitlist of the rw_semaphore has 6 tasks.
crash> list -H 0xffff9fca42897838
ffffb2fd066abd90
ffffb2fd065dfd90
ffffb2fd194efd90
ffffb2fd17a47d90
ffffb2fd1b2dfd90
ffffb2fd13bd7d60
- List of the task waiting for the rw_semaphore:
crash> list -H 0xffff9fca42897838 -s rwsem_waiter.task | grep task | tr -d , | awk '{print $3}' > rw_task
crash> ps -m < rw_task
[0 00:02:35.467] [UN] PID: 3613198 TASK: ffff9fdee0a50000 CPU: 4 COMMAND: "kworker/4:5"
[0 00:02:35.467] [UN] PID: 3613196 TASK: ffff9fe291838000 CPU: 4 COMMAND: "kworker/4:3"
[0 00:02:35.467] [UN] PID: 3606261 TASK: ffff9fd99a8aa800 CPU: 4 COMMAND: "kworker/4:7"
[0 00:02:35.467] [UN] PID: 3609395 TASK: ffff9fe68de85000 CPU: 4 COMMAND: "kworker/4:1"
[0 00:02:35.467] [UN] PID: 3612070 TASK: ffff9fd2ae145000 CPU: 4 COMMAND: "kworker/4:0"
[0 00:02:35.436] [UN] PID: 4585 TASK: ffff9fca0d9ea800 CPU: 4 COMMAND: "dio/dm-8"
crash> list -H 0xffff9fca42897838 -o rwsem_waiter.list -s rwsem_waiter.type | grep type | awk '{print $3}' | sort | uniq -c
6 RWSEM_WAITING_FOR_WRITE,
- The rw_semaphore (inode->i_rwsem) was held by the oninit task PID: 4327
crash> px ((struct rw_semaphore *)0xffff9fca42897830)->owner
$11 = {
counter = 0xffff9fc84c518000
}
o struct task_struct * == 0xffff9fc84c518000
crash> task_struct.pid 0xffff9fc84c518000
pid = 4327,
crash> ps 4327
PID PPID CPU TASK ST %MEM VSZ RSS COMM
4327 4300 9 ffff9fc84c518000 UN 0.0 29594148 25704 oninit
crash> set 4327
PID: 4327
COMMAND: "oninit"
TASK: ffff9fc84c518000 [THREAD_INFO: ffff9fc84c518000]
CPU: 9
STATE: TASK_UNINTERRUPTIBLE
- It has acquired the rw_semaphore in the ext4_file_write_iter() function.
crash> bt
PID: 4327 TASK: ffff9fc84c518000 CPU: 9 COMMAND: "oninit"
#0 [ffffb2fd13a77bf0] __schedule at ffffffffbb9eb121
#1 [ffffb2fd13a77c80] schedule at ffffffffbb9eb6e5
#2 [ffffb2fd13a77c90] ext4_unwritten_wait at ffffffffc08b7317 [ext4]
#3 [ffffb2fd13a77cd8] ext4_file_write_iter at ffffffffc08b7b46 [ext4]
#4 [ffffb2fd13a77d38] aio_write at ffffffffbb3b9e16
#5 [ffffb2fd13a77e40] io_submit_one at ffffffffbb3ba1a1
#6 [ffffb2fd13a77eb8] __x64_sys_io_submit at ffffffffbb3ba7b2
#7 [ffffb2fd13a77f38] do_syscall_64 at ffffffffbb0043ab
#8 [ffffb2fd13a77f50] entry_SYSCALL_64_after_hwframe at ffffffffbba000a9
RIP: 00007fd1d50409bd RSP: 00000000531c4da8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00007fd1d698ab78 RCX: 00007fd1d50409bd
RDX: 00000000531c4de8 RSI: 0000000000000001 RDI: 00007fd1c4593000
RBP: 00007fd1c4593000 R8: 00000001358df800 R9: 000000005667b8d8
R10: ffffffffffffffe0 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 00000000531c4de8 R15: 000000005667b850
ORIG_RAX: 00000000000000d1 CS: 0033 SS: 002b
- The oninit task PID: 4327 and dio/dm-8 thread PID: 4585 were deadlocked.
https://access.redhat.com/solutions/7006119