[SERVER-18674] Very low throughput during portion of checkpoint under WiredTiger Created: 27/May/15  Updated: 14/Apr/16  Resolved: 16/Jul/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.1.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: David Hows
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File try-47.png     PNG File try-49.png     PNG File try-50.png     HTML File wtmonitor.html    
Issue Links:
Related
related to SERVER-18875 Oplog performance on WT degrades over... Closed
related to SERVER-18677 Throughput drop during transaction pi... Closed
related to SERVER-18829 Cache usage exceeds configured maximu... Closed
related to WT-1907 Speed up transaction-refresh Closed
is related to SERVER-18314 Stall during fdatasync phase of check... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   
  • 132 GB memory, 32 processors, slowish SSDs, 20 GB WT cache
  • YCSB 10 fields/doc, 20M docs, 50/50 workload (zipfian distribution), 20 threads
  • data set size is ~23 GB but cache usage can grow to ~2x that and storage size can grow to ~3x, believe due to random update workload
  • to avoid SERVER-18314 used either of the following (with same results)
    • ext4 mounted with -o commit=10000000
    • xfs

3.1.3-pre (f5450e9bc5cf63c3dc4d9cb416713b0f6970e6d4)

3.1.3 (51ad8eff0d50387f0565d23a15e7ba1db0ea962c)

3.1.4-pre (36ac7a5d8a6cc4f6280f90ce743ab05a77a541a8)

Notes:

  • issue for this ticket is period from C-D; B-C appears to be different issue.
  • appears to be a regression in 3.1.3
  • same result with xfs and ext4 -o commit=largenumber
  • superficially similar to SERVER-18314 in that it coincides with an fdatasync but:
    • used settings that were shown to eliminate SERVER-18314 issue
    • writes did not seem to be blocked as in SERVER-18314
    • only affected one of the two periods where fadatasync runs
    • so appears to be WT issue, not platform issue
  • attempted to get stack traces but it appears process is not interruptible during the fdatasync and this issue coincides with that


 Comments   
Comment by David Hows [ 16/Jul/15 ]

Marking this as Done based on my testing, as I was unable to reproduce.

There have been a number of changes to related parts of MongoDB and WT including the linked SERVER-18875, SERVER-18829 and WT-1907. Suspect that some combination of those resolved the underlying issue.

Comment by David Hows [ 15/Jul/15 ]

I tested this along with SERVER-18677.

Ran this on an r3.4xlarge using Bruce's setup and MongoDB master as at 15/7/2015 Hash: 3f301ac62e

The instance had a generic (slow) SSD storage formatted using XFS.

With the 10M workload there was no drop in throughput associated with checkpoints as seen in Bruce's graphs. There was only one small drop in query throughput and this was associated with an increase in writes (update ops).

With the 20M workload there were no sustained drops in throughput

Comment by Michael Cahill (Inactive) [ 06/Jul/15 ]

david.hows can you please retest against MongoDB master?

Generated at Thu Feb 08 03:48:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.