[SERVER-32139] Oplog truncation creates large amount of dirty data in cache Created: 01/Dec/17  Updated: 01/Mar/18  Resolved: 28/Feb/18

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.6.0-rc8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Luke Chen
Resolution: Duplicate Votes: 0
Labels: RF36
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2017-12-04 at 8.01.26 pm.png     PNG File Screen Shot 2017-12-04 at 8.14.44 pm.png     PNG File comparison.png     PNG File correlations.png     File dd-3410.tar     File dd-rc8.tar     Zip Archive dd.wt3767_68.zip     Zip Archive dd_mdb_master.zip     File dd_wt3805.tgz     PNG File master_vs_wt3767_3768.png     PNG File master_with_wt-3805.png     PNG File optimes.png     File repro.sh     PNG File smalloplog.png    
Issue Links:
Depends
depends on WT-3805 Avoid reading lookaside pages in trun... Closed
Duplicate
is duplicated by SERVER-32547 Mongo stalls while evicting dirty cache Closed
Related
related to SERVER-32154 Unbounded lookaside table growth duri... Closed
Operating System: ALL
Sprint: Storage Non-NYC 2018-03-12
Participants:
Linked BF Score: 0

 Description   

1-node repl set, HDD, 500 GB oplog, 4 GB cache, 5 threads inserting 130 kB docs. Left is 3.6.0-rc8, right is 3.4.10:

At the point A where the oplog fills and oplog truncation begins in 3.6.0-rc8 we see large amounts of oplog being read into cache and large amounts dirty data in cache and resulting operation stalls. This does not occur at the corresponding point B when running on 3.4.10.

FTDC data for the above two runs attached.



 Comments   
Comment by Luke Chen [ 28/Feb/18 ]

WT-3805 addressed the issue recorded in this ticket. Close this ticket as a dup.

Comment by Luke Chen [ 28/Feb/18 ]

I re-ran the repro.sh script on a ubuntu machine (with SSD) with binary of the latest MongoDB master (git version: acd27aa33043cd1bdf94e9e8017001e0ea46a41a) which includes the changes from WT-3805. This time "ss wt cache dirty fill ratio" remained the same level after the oplog truncation started. The diagnostic.data archive dd_wt3805.tgz is uploaded.

Comment by Michael Cahill (Inactive) [ 27/Feb/18 ]

luke.chen, now that WT-3805 has been merged into master, can you please re-run repro.sh, gather the new diagnostic data and confirm that this issue is resolved? (That is, confirm that we see behavior like MongoDB 3.4.x, not 3.6.0).

Comment by Neha Khatri [ 15/Dec/17 ]

Attaching the diagnostic data obtained from the repro script execution on MongoDB Master and on master with WT-3767+WT-3768 dd_mdb_master.zip dd.wt3767_68.zip .

Comment by Alexander Gorrod [ 15/Dec/17 ]

Thanks neha.khatri I have created WT-3805 that outlines an optimization that should address the issue of truncate pulling a lot of content into cache.. We should rerun the workload to confirm once that ticket is resolved.

Comment by Neha Khatri [ 14/Dec/17 ]

I executed the repro script on the MongoDB master and on MongoDB with WT-3767,WT-3768 changes. Both the runs show high cache dirty fill ratio after the oplog truncation start.

The problem persists even after applying the above two changes.

Comment by Alexander Gorrod [ 13/Dec/17 ]

neha.khatri It would be great if you can reproduce this with the latest version of MongoDB master, and then re-run with the changes from WT-3767 and WT-3768 applied to see if it resolves or helps the issue.

Comment by Daniel Gottlieb (Inactive) [ 05/Dec/17 ]

Copying some analysis over:

I can confirm this benchmark can cause bubbles where the insert rate falls sharply for ~2-3 seconds correlating with a slow call to WT's log_flush. I'm not sure if that call does some locking in addition to IO that requires finer grained measurement.

Typically the journal flush takes 1-2 milliseconds. The bubble occurs when successive calls take 479, 2183, 1821 and 149 milliseconds. Assuming a 1-1 mapping of optime "incs" to inserts, the typical log flush advances the oplog read timestamp (not the oldest timestamp, but related) ~1000 documents. The large flush times advanced the oplog read timestamp ~200,000 documents.

Comment by Eric Milkie [ 05/Dec/17 ]

Yes, waiting for the log_flush call to complete. daniel.gottlieb has just confirmed this with his own local testing.

Comment by Bruce Lucas (Inactive) [ 05/Dec/17 ]

The following chart shows the lag in the optime metrics from replSetGetStatus - each "x lag" metric is the difference between that optime and the applied optime.

In particular it shows that indeed our durable optime lag exactly mirrors the range of pinned timestamps. milkie does this confirm that the delay in updating the oldest timestamp is due to waiting for durability?

Comment by Eric Milkie [ 05/Dec/17 ]

We need to keep timestamps around long enough for oplog visibility to work, and oplog visiblity depends on log_flush. We might be able to do something complicated but it would be only for one-node replica sets.

Comment by Alexander Gorrod [ 05/Dec/17 ]

There is work scheduled in WT-3767 and WT-3768 that should reduce the overhead of using the lookaside file seen here.

Ideally MongoDB would update the oldest timestamp more regularly to avoid using the lookaside file at all in this situation - milkie do you think that will be possible, or is the log_flush the smallest unit of work required to update the timestamp?

Comment by Bruce Lucas (Inactive) [ 04/Dec/17 ]

This test is bottlenecked on disk, so yes it's conceivable that an fsync could block for some time, and/or there could be a delay in the WT log implementation. It should be simple enough to add some instrumentation to see if that is the case.

Comment by Eric Milkie [ 04/Dec/17 ]

Along the same lines as my earlier idea, perhaps the thread that updates MongoDB's all_committed time is blocked waiting for its log_flush("sync=on") call to return. Would there be a situation where this call would block for long periods?

Comment by Bruce Lucas (Inactive) [ 04/Dec/17 ]

  • alexander.gorrod, I'm indeed seeing good correlation between range of timestamps pinned and lookaside activitity in my repro.
  • milkie, I can't find evidence of stalled inserts in the server log. For example, at A the timestamp hasn't been updated for about 5 seconds, but we don't see any slow inserts logged ("logged slowest query durations").

I spotted some other possibly interesting correlations with lookaside activity, significance tbd.

Comment by Eric Milkie [ 04/Dec/17 ]

One possibility for why the oldest_timestamp lags sometimes is that the all_committed time isn't moving ahead – possibly due to one lagging transaction. But that would mean one of the 5 threads would have stalled.

Comment by Bruce Lucas (Inactive) [ 04/Dec/17 ]

I've attached a self-contained repro script. alexander.gorrod, differences from yours:

  • 4 GB cache
  • considerably higher rate of inserts, possibly because the workload is i/o bound and my repro uses more compressible documents

I suspect net result is higher cache pressure which gives a stronger signal for the issue.

Comment by Bruce Lucas (Inactive) [ 03/Dec/17 ]

The same effect is seen with a 50 GB oplog and SSD (vs 500 GB oplog and HDD for the test above). Starting at A where the oplog rolls over and truncation begins we see a large amount of dirty data and operation stalls.

Generated at Thu Feb 08 04:29:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.