728x90

Environment

  • Red Hat Enterprise Linux 7 (7.7 and newer)
  • kernel-3.10.0-1062.el7 and newer.
  • Oracle Redo Logs on XFS file systems.
  • Redo Log using direct AIO via io_submit().

Issue

  • Customers who use file systems for the Oracle Redo Log, and in this specific case XFS have seen performance regressions when moving from 7.2 to 7.8.
  • Very slow io_submit syscalls lead to poor Oracle DB performance when the Oracle log writer process uses XFS filesystems, without any visible delays in iostat on the disks used by the filesystem.

Resolution

  • Moving the Redo Log destination on to EXT4 is the easiest way to resolve this for now.
  • If XFS is required, then the Redo Log block size needs to be at least equal to the filesystem blocksize. This can mean either increasing the Redo Log block size to 4k , to match the default XFS blocksize, or rebuilding the XFS filesystem with a 512b block size.
    • For instructions on changing the Redo Log block size you will need to review the database manuals and consult the database vendor (Oracle).
    • Rebuilding the XFS filesystem with a 512b blocksize can be done with the mkfs.xfs options -b size=512 -m crc=0
  • The commit introducing the delays addressed possible corruption related issues. There is no planned change in the XFS code of Red Hat Enterprise Linux 7 to improve the performance of unaligned dio writes.

Root Cause

Based on the stack traces collected while the issue was appearing, the delay is caused because of waiting in inode_dio_wait caused by the submission of direct I/O write requests that were not aligned to filesystem blocks. In such a scenario, the following upstream patch, serializes requests (more thoroughly than before) in order to avoid data corruption.

This patch has been backported to Red Hat Enterprise Linux 7, in kernel-3.10.0-1062.el7, with commit:

ce7cfcb2740 [fs] xfs: serialize unaligned dio writes against all other dio writes

XFS applies more strict serialization constraints to unaligned
direct writes to accommodate things like direct I/O layer zeroing,
unwritten extent conversion, etc. Unaligned submissions acquire the
exclusive iolock and wait for in-flight dio to complete to ensure
multiple submissions do not race on the same block and cause data
corruption.

This generally works in the case of an aligned dio followed by an
unaligned dio, but the serialization is lost if I/Os occur in the
opposite order. If an unaligned write is submitted first and
immediately followed by an overlapping, aligned write, the latter
submits without the typical unaligned serialization barriers because
there is no indication of an unaligned dio still in-flight. This can
lead to unpredictable results.

To provide proper unaligned dio serialization, require that such
direct writes are always the only dio allowed in-flight at one time
for a particular inode. We already acquire the exclusive iolock and
drain pending dio before submitting the unaligned dio. Wait once
more after the dio submission to hold the iolock across the I/O and
prevent further submissions until the unaligned I/O completes. This
is heavy handed, but consistent with the current pre-submission
serialization for unaligned direct writes.

Diagnostic Steps

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

  • Database logs frequently report long delays from the Redo Log writer, similar to the following:
  • Warning: log write elapsed time 158ms, size 35041KB
    
  • Collecting strace from the Redo Log writer process reveals io_submit calls taking long to complete, matching the logs from the database. Strace options -yy -T -tt -f provide useful information on the files being accessed and on timing. Strace will need to be attached to the Redo Log writer process with the -p option.
    In the output of strace, offsets and bytes written in each io_submit call can be seen. These are not multiples of the XFS filesystem block size (writes are unaligned to the filesystem block size and because of commit ce7cfcb2740 serialized to avoid data corruption).
    A sample of such an io_submit call as captured by strace is the following:This sample is from an XFS filesystem with the default 4k blocksize. We can observe that the offsets are not multiples of 4096 - therefore IO is not aligned to filesystem blocks.
  • 347496 13:39:59.853377 io_submit(140055665958912, 66, [
        {data=..., pwrite, fildes=262, str="..."..., nbytes=1048576, offset=5254478848},
        {data=..., pwrite, fildes=262, str="..."..., nbytes=285696, offset=5255527424},
        {data=..., pwrite, fildes=262, str="..."..., nbytes=1048576, offset=5255813120},
        {data=..., pwrite, fildes=262, str="..."..., nbytes=195072, offset=5256861696},
        {data=..., pwrite, fildes=262, str="..."..., nbytes=1048576, offset=5257056768},
        {data=..., pwrite, fildes=262, str="..."..., nbytes=843264, offset=5258105344},
        {data=..., pwrite, fildes=262, str="..."..., nbytes=1048576, offset=5258948608},
        {data=..., pwrite, fildes=262, str="..."..., nbytes=1048576, offset=5259997184},
        {data=..., pwrite, fildes=262, str="..."..., nbytes=407040, offset=5261045760},
        ...
    ...]) = 66 <0.120894>
    
  • Using the io_submit.stp systemtap script (attached at the bottom of the article), capture the slow io_submit and look for the dio_wait calls. In order to use the io_submit.stp script, the system needs to be prepared according to: What is SystemTap and how to use it?
    • Save the attached script as io_submit.stp. A copy of the script is also included in the systemtap-client package and can be found in /usr/share/systemtap/examples/io/io_submit.stp, once the package is installed. It can also be found in the upstream pages of systemtap
    • Compile and run the script using stap -g -d xfs -d kernel io_submit.stp
    • Run long enough to capture a few events.
    • Interrupt with CTRL+c. The script will only output data after stopping with CTRL+c.
    Example output from the io_submit.stp script. This output reveals that 218943+7630 io_submit calls have been waiting in inode_dio_wait called from xfs_file_dio_aio_write. This is the vast majority of calls. This code path matches the changes introduced by commit ce7cfcb2740 [fs] xfs: serialize unaligned dio writes against all other dio writes:
  • WARNING: Missing unwind data for a module, rerun with 'stap -d (unknown; retry with -DDEBUG_UNWIND)'
    WARNING: too many pending (warning) messages
    Ready!
    WARNING: Missing unwind data for a module, rerun with 'stap -d kernel'
    ^C218943: 0xffffffffae185d60 : schedule+0x0/0x70 [kernel]
     0xffffffffadc6ab49 : inode_dio_wait+0xd9/0x100 [kernel]
     0xffffffffc070c698 : xfs_file_dio_aio_write+0x2a8/0x390 [xfs]
     0xffffffffc070cb42 : xfs_file_aio_write+0x102/0x1b0 [xfs]
     0xffffffffadca4e13 : do_io_submit+0x3e3/0x8a0 [kernel]
     0xffffffffadca52e0 : sys_io_submit+0x10/0x20 [kernel]
     0xffffffffae18b459 : kretprobe_trampoline+0x0/0x57 [kernel]
     0x7fffffff7ef0 : 0x7fffffff7ef0
    7630: 0xffffffffae185d60 : schedule+0x0/0x70 [kernel]
     0xffffffffadc6ab49 : inode_dio_wait+0xd9/0x100 [kernel]
     0xffffffffc070c710 : xfs_file_dio_aio_write+0x320/0x390 [xfs]
     0xffffffffc070cb42 : xfs_file_aio_write+0x102/0x1b0 [xfs]
     0xffffffffadca4e13 : do_io_submit+0x3e3/0x8a0 [kernel]
     0xffffffffadca52e0 : sys_io_submit+0x10/0x20 [kernel]
     0xffffffffae18b459 : kretprobe_trampoline+0x0/0x57 [kernel]
     0x7fffffff7ef0 : 0x7fffffff7ef0
    511: 0xffffffffae185d60 : schedule+0x0/0x70 [kernel]
     0xffffffffae1874c5 : rwsem_down_write_failed+0x215/0x3c0 [kernel]
     0xffffffffadd96677 : call_rwsem_down_write_failed+0x17/0x30 [kernel]
     0xffff9bfd7d3bfd10 : 0xffff9bfd7d3bfd10
     0xffffffffc0719070 : xfs_ilock+0xa0/0x120 [xfs]
     0xffffffffc070c4a7 : xfs_file_dio_aio_write+0xb7/0x390 [xfs]
     0xffffffffc070cb42 : xfs_file_aio_write+0x102/0x1b0 [xfs]
     0xffffffffadca4e13 : do_io_submit+0x3e3/0x8a0 [kernel]
     0xffffffffadca52e0 : sys_io_submit+0x10/0x20 [kernel]
     0xffffffffae18b459 : kretprobe_trampoline+0x0/0x57 [kernel]
    110: 0xffffffffae185d60 : schedule+0x0/0x70 [kernel]
     0xffffffffadc6ab49 : inode_dio_wait+0xd9/0x100 [kernel]
     0xffffffffc070c698 : xfs_file_dio_aio_write+0x2a8/0x390 [xfs]
     0xffffffffc070cb42 : xfs_file_aio_write+0x102/0x1b0 [xfs]
     0xffffffffadca4e13 : do_io_submit+0x3e3/0x8a0 [kernel]
     0xffffffffadca52e0 : sys_io_submit+0x10/0x20 [kernel]
     0xffffffffae18b459 : kretprobe_trampoline+0x0/0x57 [kernel]
     0x7fffffff7790 : 0x7fffffff7790
    9: 0xffffffffae185d60 : schedule+0x0/0x70 [kernel]
     0xffffffffadc6ab49 : inode_dio_wait+0xd9/0x100 [kernel]
     0xffffffffc070c710 : xfs_file_dio_aio_write+0x320/0x390 [xfs]
     0xffffffffc070cb42 : xfs_file_aio_write+0x102/0x1b0 [xfs]
     0xffffffffadca4e13 : do_io_submit+0x3e3/0x8a0 [kernel]
     0xffffffffadca52e0 : sys_io_submit+0x10/0x20 [kernel]
     0xffffffffae18b459 : kretprobe_trampoline+0x0/0x57 [kernel]
     0x7fffffff7790 : 0x7fffffff7790
    5: 0xffffffffae185d60 : schedule+0x0/0x70 [kernel]
     0xffffffffae187775 : rwsem_down_read_failed+0x105/0x1c0 [kernel]
     0xffffffffadd96648 : call_rwsem_down_read_failed+0x18/0x30 [kernel]
     0xffff9bfd7d3bf9e0 : 0xffff9bfd7d3bf9e0
     0xffffffffc07190a9 : xfs_ilock+0xd9/0x120 [xfs]
     0xffffffffc0719120 : xfs_ilock_data_map_shared+0x30/0x40 [xfs]
     0xffffffffc06ff7ea : __xfs_get_blocks+0x9a/0x6d0 [xfs]
     0xffffffffc0700ff4 : xfs_get_blocks_direct+0x14/0x20 [xfs]
     0xffffffffadc917f6 : do_blockdev_direct_IO+0xf96/0x20a0 [kernel]
     0xffffffffadc92955 : __blockdev_direct_IO+0x55/0x60 [kernel]
     0xffffffffc070c570 : xfs_file_dio_aio_write+0x180/0x390 [xfs]
     0xffffffffc070cb42 : xfs_file_aio_write+0x102/0x1b0 [xfs]
     0xffffffffadca4e13 : do_io_submit+0x3e3/0x8a0 [kernel]
     0xffffffffadca52e0 : sys_io_submit+0x10/0x20 [kernel]
     0xffffffffae18b459 : kretprobe_trampoline+0x0/0x57 [kernel]
    2: 0xffffffffae185d60 : schedule+0x0/0x70 [kernel]
     0xffffffffadc6ab49 : inode_dio_wait+0xd9/0x100 [kernel]
     0xffffffffc070c698 : xfs_file_dio_aio_write+0x2a8/0x390 [xfs]
     0xffffffffc070cb42 : xfs_file_aio_write+0x102/0x1b0 [xfs]
     0xffffffffadca4e13 : do_io_submit+0x3e3/0x8a0 [kernel]
     0xffffffffadca52e0 : sys_io_submit+0x10/0x20 [kernel]
     0xffffffffae18b459 : kretprobe_trampoline+0x0/0x57 [kernel]
     0x7fffffff7da0 : 0x7fffffff7da0
    1: 0xffffffffae185d60 : schedule+0x0/0x70 [kernel]
     0xffffffffadc6ab49 : inode_dio_wait+0xd9/0x100 [kernel]
     0xffffffffc070c710 : xfs_file_dio_aio_write+0x320/0x390 [xfs]
     0xffffffffc070cb42 : xfs_file_aio_write+0x102/0x1b0 [xfs]
     0xffffffffadca4e13 : do_io_submit+0x3e3/0x8a0 [kernel]
     0xffffffffadca52e0 : sys_io_submit+0x10/0x20 [kernel]
     0xffffffffae18b459 : kretprobe_trampoline+0x0/0x57 [kernel]
     0x7fffffff7da0 : 0x7fffffff7da0
    

 

 

https://access.redhat.com/solutions/5906661

728x90

+ Recent posts