[SERVER-48695] setAppliedThrough must set orderedCommit=false on the recovery unit Created: 10/Jun/20  Updated: 29/Oct/23  Resolved: 11/Jun/20

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: 4.0.20, 4.4.0-rc10, 4.2.9, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Matthew Russotto Assignee: Matthew Russotto
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2, v4.0
Participants:
Linked BF Score: 20

 Description   

When we do a timestamped write that may not be strictly ordered with respect to the oplog (that is, if it has a timestamp less than or equal to the latest optime in the oplog), we must set orderedCommit to false to ensure oplog visibility is updated. ReplicationConsistencyMarkers::setAppliedThrough does not do this.

It might make sense to do this in StorageInterfaceImpl::_updateWithQuery to catch any others.



 Comments   
Comment by Githook User [ 23/Jun/20 ]

Author:

{'name': 'Matthew Russotto', 'email': 'matthew.russotto@10gen.com', 'username': 'mtrussotto'}

Message: SERVER-48695 setAppliedThrough must set orderedCommit=false on the recovery unit

(cherry picked from commit e52733a000b23cde7dcd2b2fc765a8a1f1012165)
Branch: v4.2
https://github.com/mongodb/mongo/commit/26628142b33deded802f1635990915d55970a48d

Comment by Githook User [ 23/Jun/20 ]

Author:

{'name': 'Matthew Russotto', 'email': 'matthew.russotto@10gen.com', 'username': 'mtrussotto'}

Message: SERVER-48695 setAppliedThrough must set orderedCommit=false on the recovery unit

(cherry picked from commit e52733a000b23cde7dcd2b2fc765a8a1f1012165)
Branch: v4.0
https://github.com/mongodb/mongo/commit/3da338ad27cb52475f48696bd7d88ee55576c4dc

Comment by Githook User [ 17/Jun/20 ]

Author:

{'name': 'Matthew Russotto', 'email': 'matthew.russotto@10gen.com', 'username': 'mtrussotto'}

Message: SERVER-48695 setAppliedThrough must set orderedCommit=false on the recovery unit

(cherry picked from commit e52733a000b23cde7dcd2b2fc765a8a1f1012165)
Branch: v4.4
https://github.com/mongodb/mongo/commit/8f4ebc760ca191898c5405493b2663eaa125c57a

Comment by Githook User [ 11/Jun/20 ]

Author:

{'name': 'Matthew Russotto', 'email': 'matthew.russotto@10gen.com', 'username': 'mtrussotto'}

Message: SERVER-48695 setAppliedThrough must set orderedCommit=false on the recovery unit
Branch: master
https://github.com/mongodb/mongo/commit/e52733a000b23cde7dcd2b2fc765a8a1f1012165

Comment by Daniel Gottlieb (Inactive) [ 11/Jun/20 ]

Thanks matthew.russotto, that makes sense. Delaying the oplog manager thread might have been necessary. I forgot the fact that the manager will never move backwards, so if the last hole was filled and advanced the oplog read timestamp, it's too late to make the hang happen. It's not sufficient to just hold back the all_durable timestamp.

Comment by Matthew Russotto [ 11/Jun/20 ]

I think opening large windows may not have worked because in this case, the transactions don't overlap. We need to get the RSTL in X mode before stepping down, which means the primary writes are done by the time the problem write appears. The sequence must be

Primary write with optime X completes. all_durable is now X.
Visibility CV is triggered.
The visibility thread reaches somewhere here http://morningstar/mongodb_v4.4/source/src/mongo/db/storage/wiredtiger/wiredtiger_oplog_manager.cpp#191-224
The orderedCommit=true write starts with optime X. all_durable is now X-1
The visibility thread reads all_durable http://morningstar/mongodb_v4.4/source/src/mongo/db/storage/wiredtiger/wiredtiger_oplog_manager.cpp#228
The orderedCommit=true write completes. all_durable is now X, but no trigger so we're stuck.

Opening the window on the problem write makes this more likely to happen, but the problem write still has to start within ~1ms of the last primary write (less if someone else is checking visibility, e.g. some secondary oplog reader), or it's too late.

Comment by Daniel Gottlieb (Inactive) [ 11/Jun/20 ]

I went back to my patch; the invariant I set did capture this usage as potentially problematic. The problem was that there were quite a few replication calls that set a timestamp without setting orderedCommit(false). I was unable to correctly verify whether a timestamp, "ordered" transaction usage was benign or not.

That said, I tried another approach that removed my verification and instead opened large windows on anything that was potentially problematic to allow a the oplog manager to see the hole. But that strategy relied on other writes that do trigger oplog visibility to be drained by (this problem write had to be the last thing to commit). My hypothesis is that there must be other asynchronous oplog visibility triggering writers in the system that nullified my attempt to open a window.

Comment by Daniel Gottlieb (Inactive) [ 10/Jun/20 ]

Could the oplog no-holes point be determined by the oldest oplog timestamp in progress - 1, or last oplog timestamp written if none is in progress?

That's what I was suggesting with:

I think a more holistic fix would be going back to having MongoDB manage holes.

If so maybe we could do it inline without an extra thread.

That was actually a goal of Replicate before Journaling, but it was unattained. I'm unsure what the challenge was.

Comment by Matthew Russotto [ 10/Jun/20 ]

Could the oplog no-holes point be determined by the oldest oplog timestamp in progress - 1, or last oplog timestamp written if none is in progress? (with complications for rollback)? If so maybe we could do it inline without an extra thread. But that would be a major change so we'd still need this for backport.

Comment by Daniel Gottlieb (Inactive) [ 10/Jun/20 ]

Alternatively, we can flip the ordered commit default to false and have things opt-out of triggering visibility (e.g: secondaries writing oplog entries).

It's also less clear the performance impact of having ordered commit. With replicate before journal, triggering visibility no longer does a journal flush.

Comment by Louis Williams [ 10/Jun/20 ]

I attempted to run a patch build a few months ago that invariants all timestamps must be assigned in increasing order, or orderedCommit must be false. I had to explicitly set orderedCommit=false all over the place and there's still some redness. One of the places I had to change was in setAppliedThrough. We could potentially expand upon that invariant since it caught one bug, SERVER-48565, and would have caught this had I noticed the problem at the time.

Comment by Daniel Gottlieb (Inactive) [ 10/Jun/20 ]

I think a more holistic fix would be going back to having MongoDB manage holes. The problem with WT is that it can't tell whether a transaction setting a timestamp is actually related to an oplog hole.

I also don't believe we need the waitForAllWritesToBeVisible check there. IIRC, that was added because the "truncate oplog" call was subject to visibility (implicitly/unintentionally), and cappedTruncateAfter wouldn't delete the documents the that weren't visible to it! I think we changed that behavior (not that I see where/how). I firmly believe calling cappedTruncateAfter should be not subject to visibility requirements.

When we did this visibility stuff for 3.6, the usage of timestamps was small and limited to things that actually touched the oplog. That's not the case now. We should rethink having WT do the tracking.

But the fix in this ticket is obviously easier and less risky....

Comment by Matthew Russotto [ 10/Jun/20 ]

We'd have to have some way of whitelisting the ones which are allowed to be orderedCommit. daniel.gottlieb tried to write one but it didn't catch this for some reason.

Comment by Judah Schvimer [ 10/Jun/20 ]

Is there an invariant we could add to catch this?

Generated at Thu Feb 08 05:17:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.