[SERVER-50949] waitUntilDurable call spanning rollback can set lastDurable ahead of journaling Created: 15/Sep/20 Updated: 29/Oct/23 Resolved: 15/Oct/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tess Avitabile (Inactive) | Assignee: | Lingzhi Deng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Execution Team 2020-10-19 | ||||||||
| Participants: | |||||||||
| Description |
|
A waitUntilDurable call can span the lifetime of a rollback and cause us to set lastDurable ahead of journaling as follows:
I don't think this can cause us to incorrectly respond to a client that a write is journaled or majority committed. However, it seems possible to violate internal invariants. |
| Comments |
| Comment by Githook User [ 15/Oct/20 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: |
| Comment by Tess Avitabile (Inactive) [ 13/Oct/20 ] |
|
That makes sense, and I'm curious to hear what you find! |
| Comment by Lingzhi Deng [ 09/Oct/20 ] |
|
Thanks for the info. Now that I looked more into I think the mistake I had was that I mostly was just reading at the code around recoverToStableTimestamp. While it seems not possible for the journal flusher to span across the entire rollback, but it seems possible for the journal flusher to start after recoverToStableTimestamp and span across the reset of lastApplied/lastDurable. I will verify that. If that's the case, I think we can either reset the lastApplied/lastDurable earlier or restart the journal flusher only after resetting lastApplied/lastDurable. |
| Comment by Lingzhi Deng [ 09/Oct/20 ] |
|
tess.avitabile Can you provide a patch on how to reproduce that? I can probably take a look since I am already in this code. I saw your WIP patch in |
| Comment by Tess Avitabile (Inactive) [ 09/Oct/20 ] |
|
Yes, this makes sense. dianna.hohensee thought perhaps we don't shut down the JournalFlusher for long enough during rollback, but maybe it shouldn't matter if the JournalFlusher doesn't retain any state between stopping and starting, and rollback is holding a global write lock, so no new writes will occur. I guess the next person who works on |
| Comment by Lingzhi Deng [ 06/Oct/20 ] |
|
As Dan mentioned about, as the journal flusher gets stopped before rollback can proceed and only restarted after, I dont think the scenario is possible with the journal flusher. As part of the stopStorageControls, we shut down the journal flusher synchronously. Therefore, it is not possible to have an "asynchronous thread decides to set lastDurable to OpTime(2,2)" lingering around across the rollback boundary in the above example. That said, there are other callers of waitUntilDurable() that bypass the journal flusher. They are flushAllFiles() and waitUntilUnjournaledWritesDurable(). And technically these two could be spanning rollbacks. For flushAllFiles(), its currently called in fsync (errmsgRun and FSyncLockThread::run), rebuildIndexesForNamespace, createOplog() and fsync option for writeConcern. The fsync command, repair and oplog creation all take a global lock at last in IS mode. And I believe the "fsync option for writeConcern" is no longer used? Since the rollback takes global write lock, I don't think any of them can span across rollback. For waitUntilUnjournaledWritesDurable(), it is currently called in initializeReplSetStorage, ReplicationRecoveryImpl::_recoverFromUnstableCheckpoint(as part of recoverFromOplog) and in rollback_internal::syncFixUp (during RollbackViaRefetch). We dont need to worry about initializeReplSetStorage. And recoverFromOplog is called either from start up repl recovery or from RollbackImpl::runRollback. For both recoverFromOplog in RTT and rollback_internal::syncFixUp in RollbackViaRefetch, we call resetLastOpTimesFromOplog at the end of rollback and thus we don't need to worry about the waitUntilUnjournaledWritesDurable() calls during rollback because the lastDurable and lastApplied will be reset before the rollback finishes. Therefore, based on the code auditing above, I don't think it is possible for an "asynchronous thread that decides to set lastDurable" to span across rollback. tess.avitabile dianna.hohensee Does the above make sense? But I could be missing something too. |
| Comment by Dianna Hohensee (Inactive) [ 16/Sep/20 ] |
|
The JournalFlusher doesn't get stopped on stepdown, we just play a dance to reset its behavior to non-primary mode, via a relatively complicated interruption sequence on stepdown. The JournalFlusher is only shutdown and restarted momentarily for StorageEngine::recoverToStableTimestamp. I've moved the JournalFlusher to be conceptually above the storage layer, to control the storage layer but not be a part of it. The argument is that the JournalFlusher exists to help replication manage the storage engine for performance gains, and ensure users don't hurt themselves by never doing j:true writes – I could word better with more thought |
| Comment by Daniel Gottlieb (Inactive) [ 16/Sep/20 ] |
Ah interesting. Should we remove waitUntilDurable from the recovery unit to make sure no new callers get introduced? Second: |
| Comment by Dianna Hohensee (Inactive) [ 16/Sep/20 ] |
|
daniel.gottlieb I've moved all the waitUntilDurable() callers onto the JournalFlusher recently. flushAllFiles() and waitUntilUnjournaledWritesDurable() calls are the only ways to bypass the JournalFlusher now, the latter of which is only called during repl rollback/recovery. We moved callers onto the JournalFlusher, so we could ensure interruption of setting the oplogTruncateAfterPoint on stepdown, because we want to clear the oplogTruncateAfterPoint on stepdown. |
| Comment by Daniel Gottlieb (Inactive) [ 16/Sep/20 ] |
|
dianna.hohensee can you clarify what part of waitUntilDurable synchronizes with whether the JournalFlusher is running? If the onDurable call is what's advancing lastDurable, it seems that turning the journal flusher off would still allow for the lastDurable time to be advanced. |
| Comment by Dianna Hohensee (Inactive) [ 15/Sep/20 ] |
|
We should investigate when during rollback lastApplied is unset/reset, and what opportunities the JournalFlusher has to pick up a lastApplied timestamp value before the reset. The JournalFlusher is turned off and on around calling recoverToAStableTimestamp on the storage engine. A solution might be to turn the JournalFlusher off for a larger duration of rollback, so it can't asynchronously grab anything mischievous. |