[SERVER-48695] setAppliedThrough must set orderedCommit=false on the recovery unit Created: 10/Jun/20 Updated: 29/Oct/23 Resolved: 11/Jun/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Storage |
| Affects Version/s: | None |
| Fix Version/s: | 4.0.20, 4.4.0-rc10, 4.2.9, 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthew Russotto | Assignee: | Matthew Russotto |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v4.4, v4.2, v4.0
|
||||||||
| Participants: | |||||||||
| Linked BF Score: | 20 | ||||||||
| Description |
|
When we do a timestamped write that may not be strictly ordered with respect to the oplog (that is, if it has a timestamp less than or equal to the latest optime in the oplog), we must set orderedCommit to false to ensure oplog visibility is updated. ReplicationConsistencyMarkers::setAppliedThrough does not do this. It might make sense to do this in StorageInterfaceImpl::_updateWithQuery to catch any others. |
| Comments |
| Comment by Githook User [ 23/Jun/20 ] |
|
Author: {'name': 'Matthew Russotto', 'email': 'matthew.russotto@10gen.com', 'username': 'mtrussotto'}Message: (cherry picked from commit e52733a000b23cde7dcd2b2fc765a8a1f1012165) |
| Comment by Githook User [ 23/Jun/20 ] |
|
Author: {'name': 'Matthew Russotto', 'email': 'matthew.russotto@10gen.com', 'username': 'mtrussotto'}Message: (cherry picked from commit e52733a000b23cde7dcd2b2fc765a8a1f1012165) |
| Comment by Githook User [ 17/Jun/20 ] |
|
Author: {'name': 'Matthew Russotto', 'email': 'matthew.russotto@10gen.com', 'username': 'mtrussotto'}Message: (cherry picked from commit e52733a000b23cde7dcd2b2fc765a8a1f1012165) |
| Comment by Githook User [ 11/Jun/20 ] |
|
Author: {'name': 'Matthew Russotto', 'email': 'matthew.russotto@10gen.com', 'username': 'mtrussotto'}Message: |
| Comment by Daniel Gottlieb (Inactive) [ 11/Jun/20 ] |
|
Thanks matthew.russotto, that makes sense. Delaying the oplog manager thread might have been necessary. I forgot the fact that the manager will never move backwards, so if the last hole was filled and advanced the oplog read timestamp, it's too late to make the hang happen. It's not sufficient to just hold back the all_durable timestamp. |
| Comment by Matthew Russotto [ 11/Jun/20 ] |
|
I think opening large windows may not have worked because in this case, the transactions don't overlap. We need to get the RSTL in X mode before stepping down, which means the primary writes are done by the time the problem write appears. The sequence must be Primary write with optime X completes. all_durable is now X. Opening the window on the problem write makes this more likely to happen, but the problem write still has to start within ~1ms of the last primary write (less if someone else is checking visibility, e.g. some secondary oplog reader), or it's too late. |
| Comment by Daniel Gottlieb (Inactive) [ 11/Jun/20 ] |
|
I went back to my patch; the invariant I set did capture this usage as potentially problematic. The problem was that there were quite a few replication calls that set a timestamp without setting orderedCommit(false). I was unable to correctly verify whether a timestamp, "ordered" transaction usage was benign or not. That said, I tried another approach that removed my verification and instead opened large windows on anything that was potentially problematic to allow a the oplog manager to see the hole. But that strategy relied on other writes that do trigger oplog visibility to be drained by (this problem write had to be the last thing to commit). My hypothesis is that there must be other asynchronous oplog visibility triggering writers in the system that nullified my attempt to open a window. |
| Comment by Daniel Gottlieb (Inactive) [ 10/Jun/20 ] |
That's what I was suggesting with:
That was actually a goal of Replicate before Journaling, but it was unattained. I'm unsure what the challenge was. |
| Comment by Matthew Russotto [ 10/Jun/20 ] |
|
Could the oplog no-holes point be determined by the oldest oplog timestamp in progress - 1, or last oplog timestamp written if none is in progress? (with complications for rollback)? If so maybe we could do it inline without an extra thread. But that would be a major change so we'd still need this for backport. |
| Comment by Daniel Gottlieb (Inactive) [ 10/Jun/20 ] |
|
Alternatively, we can flip the ordered commit default to false and have things opt-out of triggering visibility (e.g: secondaries writing oplog entries). It's also less clear the performance impact of having ordered commit. With replicate before journal, triggering visibility no longer does a journal flush. |
| Comment by Louis Williams [ 10/Jun/20 ] |
|
I attempted to run a patch build a few months ago that invariants all timestamps must be assigned in increasing order, or orderedCommit must be false. I had to explicitly set orderedCommit=false all over the place and there's still some redness. One of the places I had to change was in setAppliedThrough. We could potentially expand upon that invariant since it caught one bug, |
| Comment by Daniel Gottlieb (Inactive) [ 10/Jun/20 ] |
|
I think a more holistic fix would be going back to having MongoDB manage holes. The problem with WT is that it can't tell whether a transaction setting a timestamp is actually related to an oplog hole. I also don't believe we need the waitForAllWritesToBeVisible check there. IIRC, that was added because the "truncate oplog" call was subject to visibility (implicitly/unintentionally), and cappedTruncateAfter wouldn't delete the documents the that weren't visible to it! I think we changed that behavior (not that I see where/how). I firmly believe calling cappedTruncateAfter should be not subject to visibility requirements. When we did this visibility stuff for 3.6, the usage of timestamps was small and limited to things that actually touched the oplog. That's not the case now. We should rethink having WT do the tracking. But the fix in this ticket is obviously easier and less risky.... |
| Comment by Matthew Russotto [ 10/Jun/20 ] |
|
We'd have to have some way of whitelisting the ones which are allowed to be orderedCommit. daniel.gottlieb tried to write one but it didn't catch this for some reason. |
| Comment by Judah Schvimer [ 10/Jun/20 ] |
|
Is there an invariant we could add to catch this? |