[SERVER-36534] Don't acquire locks on oplog when writing oplog entries Created: 08/Aug/18 Updated: 29/Oct/23 Resolved: 24/Aug/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Storage |
| Affects Version/s: | None |
| Fix Version/s: | 4.0.4, 4.1.3 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | Eric Milkie |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||||||||||||||||||||||
| Sprint: | Storage NYC 2018-08-27 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||
| Linked BF Score: | 36 | ||||||||||||||||||||||||||||||||
| Description |
|
Since the oplog can never be dropped, there's no need to hold an IX lock on the oplog when writing into it. |
| Comments |
| Comment by Githook User [ 30/Oct/18 ] | |||||||||||||
|
Author: {'name': 'Eric Milkie', 'email': 'milkie@10gen.com', 'username': 'milkie'}Message: (cherry picked from commit 5c1a3ec728a71bca81629f99be782ac305a6ad4b) | |||||||||||||
| Comment by Githook User [ 24/Aug/18 ] | |||||||||||||
|
Author: {'name': 'Eric Milkie', 'email': 'milkie@10gen.com', 'username': 'milkie'}Message: | |||||||||||||
| Comment by Michael Cahill (Inactive) [ 17/Aug/18 ] | |||||||||||||
|
I spent some time trying to reproduce this and ran into the problem that calling WT's verify method on the oplog always fails with EBUSY. That's because having a session open is preventing the stable timestamp from catching up with current. After closing all sessions and waiting for the logical session reaper to run, the stable timestamp does catch up. The verify still doesn't succeed (it hits a different EBUSY): I'll chase that some more next week. Once we can reproduce the symptom more easily, here is how I'd expect to catch it quickly:
| |||||||||||||
| Comment by Eric Milkie [ 15/Aug/18 ] | |||||||||||||
|
In testing for this, I'm hitting a problem with either cursor caching (in WiredTiger) or in verify. Continuing to pursue the issues. I'm hopeful that once they are resolved, we'll be able to push this code change and resolve the flavor of deadlock described in | |||||||||||||
| Comment by Spencer Brody (Inactive) [ 08/Aug/18 ] | |||||||||||||
|
Per conversation with milkie I am passing this off to the storage team. The core code change is small, but investigating the build failures that fall out from it requires expertise in the storage subsystem. | |||||||||||||
| Comment by Spencer Brody (Inactive) [ 08/Aug/18 ] | |||||||||||||
|
This is another potential fix for the deadlock described in |