[SERVER-18314] Stall during fdatasync phase of checkpoints under WiredTiger and EXT4 Created: 04/May/15  Updated: 09/Nov/23  Resolved: 19/Jun/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Ramon Fernandez Marina
Resolution: Duplicate Votes: 0
Labels: WTplaybook, wttt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File count=1000000.png     PNG File lz4.png     PNG File snappy.png     PNG File try-13.png     PNG File try-14.png     PNG File try-16.png     PNG File try-21.png     PNG File try-23.png     PNG File try-25.png     PNG File try-26.png     PNG File try-29.png     PNG File try-30.png     PNG File try-33.png     PNG File try-34.png     PNG File try-35.png     PNG File try-36.png    
Issue Links:
Depends
depends on WT-1941 Sync the journal directory, not its p... Closed
Documented
Duplicate
duplicates DOCS-5506 Recommend use of XFS Closed
Related
related to SERVER-18629 WiredTiger journal system syncs wrong... Closed
related to SERVER-18674 Very low throughput during portion of... Closed
related to DOCS-7104 EXT4 and WT incompatibilities Closed
is related to SERVER-18315 Throughput drop during transaction pi... Closed
is related to SERVER-22606 Startup warning if ext4 is used with ... Closed
Operating System: ALL
Participants:
Case:

 Description   
  • YCSB 30M documents, 10 fields, ~1kB/document, total ~30GB
  • 50/50 read/update workload
  • 40 GB cache, 128 GB memory, 32 CPUs
  • slow SSD disk (~80-100 MB/s)
  • no journal (to simplify the situation)
  • per mongostat, cache is at 100% utilization, 80% dirty pretty much throughout the test.

During each checkpoint two calls to fdatasync are made. Because this scenario is i/o constrained the fdatasyncs take a substantial amount of time, and during both fdatasync calls throughput falls to exactly 0 for the duration of the fdatasync. This is seen in A-B, C-D, E-F, G-H, I-J, K-L below.

In many, but not all, such cases WT bumps the "eviction server unable to reach goal" counter.

Similar test with a larger cache (the default 64GB) does not show this issue.

Note: this is the same test as reported in SERVER-18315; opening two separate tickets to track what may be separate issues.



 Comments   
Comment by Daniel Pasette (Inactive) [ 19/Jun/15 ]

Resolving this issue as a duplicate of DOCS-5506. Production notes section has been updated to reflect the recommendation to use xfs.

Comment by Bruce Lucas (Inactive) [ 19/Jun/15 ]

Wrote a standalone C program that does writes on some number of threads while doing an fdatasync, and demonstrated that the cause of the stalls in mongod was that writes to the file can block for the duration of the fdatasync. Also observed that the writes would not necessarily stall immediately after the start of the fdatasync, but might only begin to stall after a few seconds. Using perf to probe context switch events was able to determine that this is the point where the writes block:

fs 30526 945970.213337: cs: 
   ffffffff8172641f __schedule ([kernel.kallsyms])
   ffffffff817269d9 schedule ([kernel.kallsyms])
   ffffffff81287435 wait_transaction_locked ([kernel.kallsyms])
   ffffffff81287750 start_this_handle ([kernel.kallsyms])
   ffffffff81287c73 jbd2__journal_start ([kernel.kallsyms])
   ffffffff8126eb19 __ext4_journal_start_sb ([kernel.kallsyms])
   ffffffff8124735a ext4_dirty_inode ([kernel.kallsyms])
   ffffffff811e7cba __mark_inode_dirty ([kernel.kallsyms])
   ffffffff811d8cc1 update_time ([kernel.kallsyms])
   ffffffff811d8ed0 file_update_time ([kernel.kallsyms])
   ffffffff81150a40 __generic_file_aio_write ([kernel.kallsyms])
   ffffffff81150ce8 generic_file_aio_write ([kernel.kallsyms])
   ffffffff8123bce2 ext4_file_write ([kernel.kallsyms])
   ffffffff811bd7ea do_sync_write ([kernel.kallsyms])
   ffffffff811bdf74 vfs_write ([kernel.kallsyms])
   ffffffff811be9a9 sys_write ([kernel.kallsyms])
   ffffffff817330bd system_call_fastpath ([kernel.kallsyms])
       7fccf4d8235d [unknown] (/lib/x86_64-linux-gnu/libpthread-2.19.so)
           7fccf49cfec5 __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so)

So it appears that the stall is related to ext4 journaling - apparently journaling is blocked while the fdatasync is in progress. The delay in stalling mentioned above is explained by the periodic journal commits: the writer thread is able to continue writing during fdatasync until the next journal commit; the default commit interval is 5 seconds, and indeed observed that the writes might be able to continue up to 5 seconds after the start of the fdatasync.

Since the stalls occur in ext4-specific code, and are related to periodic journal commits, we can avoid the stalls, both in the test program and in mongod, by either

  • using XFS
  • mounting the EXT4 filesystem with a large commit interval (-o commit=largenumber) to effectively disable periodic journal commits. Note: this is safe for mongod because it ensures durability by doing fdatasyncs at appropriate times, but it dangerous with regard durability for typical programs that might happen to be using the same filesystem.
Comment by Bruce Lucas (Inactive) [ 13/May/15 ]

michael.cahill, tried master as of yesterday's nightly build.
YCSB 50/50, 10M docs 10 fields, 20 GB cache, 20 threads

  • master indeed seems to fix the issue during the "transaction pinned" phase (SERVER-18315)
  • the fdatasync stall (this ticket) can be fixed by in addition either
    • using ext4 with -o commit=largenumber mount option, or
    • using xfs

3.0.2

MASTER, EXT4

  • fixes SERVER-18315 (C-D)
  • fdatasync stalls (this ticket, B-C, D-E) remain

MASTER, EXT4, BUT DISABLE EXT4 TIMED JOURNAL COMMITS (-o commit=... option)

  • fixes fdatasync stalls

MASTER, XFS

  • also fixes fdatasync stalls
Generated at Thu Feb 08 03:47:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.