[SERVER-32139] Oplog truncation creates large amount of dirty data in cache Created: 01/Dec/17 Updated: 01/Mar/18 Resolved: 28/Feb/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.6.0-rc8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Luke Chen |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | RF36 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Sprint: | Storage Non-NYC 2018-03-12 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||
| Description |
|
1-node repl set, HDD, 500 GB oplog, 4 GB cache, 5 threads inserting 130 kB docs. Left is 3.6.0-rc8, right is 3.4.10:
At the point A where the oplog fills and oplog truncation begins in 3.6.0-rc8 we see large amounts of oplog being read into cache and large amounts dirty data in cache and resulting operation stalls. This does not occur at the corresponding point B when running on 3.4.10. FTDC data for the above two runs attached. |
| Comments |
| Comment by Luke Chen [ 28/Feb/18 ] |
|
|
| Comment by Luke Chen [ 28/Feb/18 ] |
|
I re-ran the repro.sh script on a ubuntu machine (with SSD) with binary of the latest MongoDB master (git version: acd27aa33043cd1bdf94e9e8017001e0ea46a41a) which includes the changes from |
| Comment by Michael Cahill (Inactive) [ 27/Feb/18 ] |
|
luke.chen, now that |
| Comment by Neha Khatri [ 15/Dec/17 ] |
|
Attaching the diagnostic data obtained from the repro script execution on MongoDB Master and on master with |
| Comment by Alexander Gorrod [ 15/Dec/17 ] |
|
Thanks neha.khatri I have created |
| Comment by Neha Khatri [ 14/Dec/17 ] |
|
I executed the repro script on the MongoDB master and on MongoDB with The problem persists even after applying the above two changes. |
| Comment by Alexander Gorrod [ 13/Dec/17 ] |
|
neha.khatri It would be great if you can reproduce this with the latest version of MongoDB master, and then re-run with the changes from WT-3767 and WT-3768 applied to see if it resolves or helps the issue. |
| Comment by Daniel Gottlieb (Inactive) [ 05/Dec/17 ] |
|
Copying some analysis over:
|
| Comment by Eric Milkie [ 05/Dec/17 ] |
|
Yes, waiting for the log_flush call to complete. daniel.gottlieb has just confirmed this with his own local testing. |
| Comment by Bruce Lucas (Inactive) [ 05/Dec/17 ] |
|
The following chart shows the lag in the optime metrics from replSetGetStatus - each "x lag" metric is the difference between that optime and the applied optime.
In particular it shows that indeed our durable optime lag exactly mirrors the range of pinned timestamps. milkie does this confirm that the delay in updating the oldest timestamp is due to waiting for durability? |
| Comment by Eric Milkie [ 05/Dec/17 ] |
|
We need to keep timestamps around long enough for oplog visibility to work, and oplog visiblity depends on log_flush. We might be able to do something complicated but it would be only for one-node replica sets. |
| Comment by Alexander Gorrod [ 05/Dec/17 ] |
|
There is work scheduled in Ideally MongoDB would update the oldest timestamp more regularly to avoid using the lookaside file at all in this situation - milkie do you think that will be possible, or is the log_flush the smallest unit of work required to update the timestamp? |
| Comment by Bruce Lucas (Inactive) [ 04/Dec/17 ] |
|
This test is bottlenecked on disk, so yes it's conceivable that an fsync could block for some time, and/or there could be a delay in the WT log implementation. It should be simple enough to add some instrumentation to see if that is the case. |
| Comment by Eric Milkie [ 04/Dec/17 ] |
|
Along the same lines as my earlier idea, perhaps the thread that updates MongoDB's all_committed time is blocked waiting for its log_flush("sync=on") call to return. Would there be a situation where this call would block for long periods? |
| Comment by Bruce Lucas (Inactive) [ 04/Dec/17 ] |
|
I spotted some other possibly interesting correlations with lookaside activity, significance tbd. |
| Comment by Eric Milkie [ 04/Dec/17 ] |
|
One possibility for why the oldest_timestamp lags sometimes is that the all_committed time isn't moving ahead – possibly due to one lagging transaction. But that would mean one of the 5 threads would have stalled. |
| Comment by Bruce Lucas (Inactive) [ 04/Dec/17 ] |
|
I've attached a self-contained repro script. alexander.gorrod, differences from yours:
I suspect net result is higher cache pressure which gives a stronger signal for the issue. |
| Comment by Bruce Lucas (Inactive) [ 03/Dec/17 ] |
|
The same effect is seen with a 50 GB oplog and SSD (vs 500 GB oplog and HDD for the test above). Starting at A where the oplog rolls over and truncation begins we see a large amount of dirty data and operation stalls.
|