[SERVER-34129] Don't hold database or collection locks during oplog truncation Created: 26/Mar/18 Updated: 29/Oct/23 Resolved: 05/Jul/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage, WiredTiger |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.7, 4.0.1, 4.1.1 |
| Type: | Improvement | Priority: | Critical - P2 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Xiangyu Yao (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | QFB, nyc | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Backport Requested: |
v4.0, v3.6, v3.4
|
||||||||||||||||
| Sprint: | Storage NYC 2018-06-18, Storage NYC 2018-07-02, Storage NYC 2018-07-16 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Description |
|
Some customers are sensitive to the short stalls that can be caused by oplog truncation, which can be a moderately heavy operation. We could mitigate this issue by providing a parameter to tune the amount of oplog data truncated in a single operation. |
| Comments |
| Comment by Githook User [ 20/Jul/18 ] |
|
Author: {'username': 'xy24', 'name': 'Xiangyu Yao', 'email': 'xiangyu.yao@mongodb.com'}Message: (cherry picked from commit 9008da291dd91dae2ecf0529e1fa8a31329389c5) |
| Comment by Githook User [ 09/Jul/18 ] |
|
Author: {'username': 'xy24', 'name': 'Xiangyu Yao', 'email': 'xiangyu.yao@mongodb.com'}Message: (cherry picked from commit 9008da291dd91dae2ecf0529e1fa8a31329389c5) |
| Comment by Githook User [ 05/Jul/18 ] |
|
Author: {'username': 'xy24', 'name': 'Xiangyu Yao', 'email': 'xiangyu.yao@mongodb.com'}Message: |
| Comment by Xiangyu Yao (Inactive) [ 22/Jun/18 ] |
|
Potential blocking commands/operations: AppendOplogNote command takes a global X lock in 3.4, but IX lock in 3.6 and 4.0. DropDatabase takes global X locks in 3.4, 3.6 and 4.0.
|
| Comment by Xiangyu Yao (Inactive) [ 21/Jun/18 ] |
|
It's hard to remove all the locks of oplog collection and treat it specially. The main issue is from restarting the catalog by rollback, when we need to hold a reference to the oplog collection on closeCatalog and later restore it. It requires new logic specifically for oplog collection. Also, the change cannot be easily backported to v3.4 and v3.6 because of different rollback logic. If the main issue here is that oplog truncation blocks oplog application on the secondary, we should opt out of the PBWM at oplog truncation. Furthermore, if we don't want the truncation to block listCollections which takes a database S lock, we could let oplog truncation take a Global IX (which prevents restartCatalog) and don't take database/collection IX locks. |
| Comment by Bruce Lucas (Inactive) [ 11/Jun/18 ] |
|
OK, that sounds good. So maybe we should change the headline again to "Don't hold any locks during oplog truncation"? |
| Comment by Eric Milkie [ 11/Jun/18 ] |
|
An intent lock is actually not required here. We've been playing around with the idea that the oplog can be treated specially w.r.t. locking since it already has special properties applied to it (such as it can never be dropped), which allows us to relax the locking model a bit. |
| Comment by Bruce Lucas (Inactive) [ 11/Jun/18 ] |
|
If I understand correctly that won't help with the original issue that this ticket was opened for which was higher latency on secondaries during oplog truncation on 3.4 and 3.6, will it? At least an intent lock will still be required which will block the oplog application global lock which will block all other reads. (That particular problem won't exist in 4.0 though.) |
| Comment by Eric Milkie [ 11/Jun/18 ] |
|
We've decided the best way forward here is to change the coding logic such that no database locks are held while oplog truncation is occurring, to prevent it from affecting other operations if it takes a long time. |
| Comment by Alexander Gorrod [ 04/Jun/18 ] |
|
michael.cahill The alternative WiredTiger API to limit how much data truncate will remove feels like it will introduce ongoing maintenance issues, as such I prefer your suggestion to:
Another potential avenue for alleviating the symptom might be to not hold a database lock while doing the truncate? milkie or geert.bosch could you pick this one up? |
| Comment by Bruce Lucas (Inactive) [ 30/May/18 ] |
|
I suspect that generally customers who care about the small additional latencies we're talking about here probably don't have very bursty workloads, because that would make latencies unpredictable. |
| Comment by Michael Cahill (Inactive) [ 30/May/18 ] |
True, but we could return a result like WT_RESTART / EAGAIN to indicate that the application should keep calling truncate (with the same key range) until it is successful. It would also relatively easy for the reclaimOplog to extrapolate from the total number of entries in the oplog and estimate how many will be in the 1% being truncated, then interpolate to truncate in bite-sized pieces. This won't deal with very bursty workloads, but may be good enough for practical purposes. |
| Comment by Alexander Gorrod [ 09/May/18 ] |
The semantic feels quite strange if we try to push this problem down inside WiredTiger - the truncate API would need to be "delete this data, or as much as you can within X milliseconds". The caller won't know what data was deleted after the call has completed. milkie Would it be practical to instead track more information inside an individual oplog stone via WiredTigerRecordStore::OplogStones::InsertChange to track records as the stone is populated i.e: one every 1000 records (or some number), then truncate for a stone could be broken into several parts. The code in WiredTigerRecordStore::reclaimOplog would need to become more sophisticated in terms of continuing to work in the face of partially completing a reclaim operation, but I think that's the case no matter which approach is taken. |
| Comment by Eric Milkie [ 16/Apr/18 ] |
|
I wonder if we could put a ceiling on the number of documents that are encapsulated between each Stone; this would prevent really-large oplog sizes from truncating the oplog too coarsely. It would possibly increase startup time, though, while the Stones are initialized. |
| Comment by Michael Cahill (Inactive) [ 16/Apr/18 ] |
|
milkie, I don't think truncate is doing significant amount of I/O relative to the size of the truncation. The issue as bruce.lucas pointed out in the linked ticket is "the truncation is done as a single unyielding operation, so it holds an intent lock... This stalls replication waiting for global W lock which in turn stalls everything else while the truncate runs." One option would be to keep the current OplogStones logic that tracks markers every ~1% of the oplog, but perform a series of truncate operations (yielding in between) to free each range. We could consider driving down into WT's truncate operation a maximum number of pages or maximum time to make the MongoDB code more straightforward. |
| Comment by Eric Milkie [ 13/Apr/18 ] |
|
michael.cahill might this be a result of truncate using more I/O than expected? Are there changes we can do to improve this in WiredTiger itself? |
| Comment by Bruce Lucas (Inactive) [ 12/Apr/18 ] |
|
This may not require a parameter. What is really needed is to ensure that the oplog truncation doesn't take too long while holding locks. Truncating 1% of the oplog in a single WT truncate operation may violate this. If we could arrange our truncations in smaller batches so they each complete in a short time that would be sufficient. |
| Comment by Ian Whalen (Inactive) [ 30/Mar/18 ] |
|
asya to talk with bruce.lucas about what the best solution is for the underlying problem. |