[SERVER-50949] waitUntilDurable call spanning rollback can set lastDurable ahead of journaling Created: 15/Sep/20  Updated: 29/Oct/23  Resolved: 15/Oct/20

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Tess Avitabile (Inactive) Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-47898 Advancing lastDurable irrespective of... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2020-10-19
Participants:

 Description   

A waitUntilDurable call can span the lifetime of a rollback and cause us to set lastDurable ahead of journaling as follows:

  • lastApplied = OpTime(2,2), lastDurable = OpTime(1,1)
  • Asynchronous thread decides to set lastDurable to OpTime(2,2)
  • Rollback occurs. lastApplied = OpTime(1,1), lastDurable = OpTime(1,1)
  • Apply oplog entry. lastApplied = OpTime(1,2), lastDurable = OpTime(1,1)
  • Apply oplog entry. lastApplied = OpTime(3,3), lastDurable = OpTime(1,1)
  • Asynchronous thread succeeds in setting lastDurable to OpTime(2,2). This is not a time in the oplog. Additionally, this implies that OpTime(1,2) is journaled, which it is not.

I don't think this can cause us to incorrectly respond to a client that a write is journaled or majority committed. However, it seems possible to violate internal invariants.



 Comments   
Comment by Githook User [ 15/Oct/20 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-50949: Return empty JournalListener token during rollback
Branch: master
https://github.com/mongodb/mongo/commit/9f52be8cf511d4fe0b2aa8eb20e18c47a5bf097d

Comment by Tess Avitabile (Inactive) [ 13/Oct/20 ]

That makes sense, and I'm curious to hear what you find!

Comment by Lingzhi Deng [ 09/Oct/20 ]

Thanks for the info. Now that I looked more into SERVER-47898, I think I might be wrong. It seems that even if the journal flusher is stopped and restarted, but it is restarted immediately after recoverToStableTimestamp. So that means the journal flusher is restarted before the lastApplied is reset. So in the window between then and the reset of lastApplied/lastDurable, there indeed could be an async thread that reads the old lastApplied and spans across the reset of lastApplied/lastDurable. In that case, it could be using an old lastApplied in setting the lastDurable after rollback.

I think the mistake I had was that I mostly was just reading at the code around recoverToStableTimestamp. While it seems not possible for the journal flusher to span across the entire rollback, but it seems possible for the journal flusher to start after recoverToStableTimestamp and span across the reset of lastApplied/lastDurable. I will verify that.

If that's the case, I think we can either reset the lastApplied/lastDurable earlier or restart the journal flusher only after resetting lastApplied/lastDurable.

Comment by Lingzhi Deng [ 09/Oct/20 ]

tess.avitabile Can you provide a patch on how to reproduce that? I can probably take a look since I am already in this code. I saw your WIP patch in SERVER-47898. Do you remember which test to run? Thanks.

Comment by Tess Avitabile (Inactive) [ 09/Oct/20 ]

Yes, this makes sense. dianna.hohensee thought perhaps we don't shut down the JournalFlusher for long enough during rollback, but maybe it shouldn't matter if the JournalFlusher doesn't retain any state between stopping and starting, and rollback is holding a global write lock, so no new writes will occur.

I guess the next person who works on SERVER-47898 will have to look further to explain why advancing lastApplied when we advance lastDurable causes us to violate this invariant. What I observed in my patch was that after a rollback, we would attempt to set lastApplied to an OpTime with a higher term and a lower timestamp. This could occur if there's an asynchronous thread that decides to set lastDurable and spans across rollback, since it could set lastDurable (and lastApplied) to an OpTime ahead of the common point that no longer exists in the oplog after the rollback. Then the next time we set lastApplied as part of secondary oplog application, we would hit the invariant.

Comment by Lingzhi Deng [ 06/Oct/20 ]

As Dan mentioned about, as the journal flusher gets stopped before rollback can proceed and only restarted after, I dont think the scenario is possible with the journal flusher. As part of the stopStorageControls, we shut down the journal flusher synchronously. Therefore, it is not possible to have an "asynchronous thread decides to set lastDurable to OpTime(2,2)" lingering around across the rollback boundary in the above example.

That said, there are other callers of waitUntilDurable() that bypass the journal flusher. They are flushAllFiles() and waitUntilUnjournaledWritesDurable(). And technically these two could be spanning rollbacks.

For flushAllFiles(), its currently called in fsync (errmsgRun and FSyncLockThread::run), rebuildIndexesForNamespace, createOplog() and fsync option for writeConcern. The fsync command, repair and oplog creation all take a global lock at last in IS mode. And I believe the "fsync option for writeConcern" is no longer used? Since the rollback takes global write lock, I don't think any of them can span across rollback.

For waitUntilUnjournaledWritesDurable(), it is currently called in initializeReplSetStorage, ReplicationRecoveryImpl::_recoverFromUnstableCheckpoint(as part of recoverFromOplog) and in rollback_internal::syncFixUp (during RollbackViaRefetch). We dont need to worry about initializeReplSetStorage. And recoverFromOplog is called either from start up repl recovery or from RollbackImpl::runRollback. For both recoverFromOplog in RTT and rollback_internal::syncFixUp in RollbackViaRefetch, we call resetLastOpTimesFromOplog at the end of rollback and thus we don't need to worry about the waitUntilUnjournaledWritesDurable() calls during rollback because the lastDurable and lastApplied will be reset before the rollback finishes.

Therefore, based on the code auditing above, I don't think it is possible for an "asynchronous thread that decides to set lastDurable" to span across rollback.

tess.avitabile dianna.hohensee Does the above make sense? But I could be missing something too.

Comment by Dianna Hohensee (Inactive) [ 16/Sep/20 ]

The JournalFlusher doesn't get stopped on stepdown, we just play a dance to reset its behavior to non-primary mode, via a relatively complicated interruption sequence on stepdown. The JournalFlusher is only shutdown and restarted momentarily for StorageEngine::recoverToStableTimestamp.

I've moved the JournalFlusher to be conceptually above the storage layer, to control the storage layer but not be a part of it. The argument is that the JournalFlusher exists to help replication manage the storage engine for performance gains, and ensure users don't hurt themselves by never doing j:true writes – I could word better with more thought  Therefore, we still need the RecoveryUnit::waitUntilDurable() access point. Though it might be nice to discourage its use somehow.

Comment by Daniel Gottlieb (Inactive) [ 16/Sep/20 ]

I've moved all the waitUntilDurable() callers onto the JournalFlusher recently.

Ah interesting. Should we remove waitUntilDurable from the recovery unit to make sure no new callers get introduced?

Second:
If all the callers are in the journal flusher, and the journal flusher gets stopped before rollback can proceed and only restarted after, how is this scenario possible? Are we restarting the journal flusher before the oplog truncation part of rollback is performed (cappedTruncateAfter I believe)?

Comment by Dianna Hohensee (Inactive) [ 16/Sep/20 ]

daniel.gottlieb I've moved all the waitUntilDurable() callers onto the JournalFlusher recently. flushAllFiles() and waitUntilUnjournaledWritesDurable() calls are the only ways to bypass the JournalFlusher now, the latter of which is only called during repl rollback/recovery. We moved callers onto the JournalFlusher, so we could ensure interruption of setting the oplogTruncateAfterPoint on stepdown, because we want to clear the oplogTruncateAfterPoint on stepdown.

Comment by Daniel Gottlieb (Inactive) [ 16/Sep/20 ]

dianna.hohensee can you clarify what part of waitUntilDurable synchronizes with whether the JournalFlusher is running? If the onDurable call is what's advancing lastDurable, it seems that turning the journal flusher off would still allow for the lastDurable time to be advanced.

Comment by Dianna Hohensee (Inactive) [ 15/Sep/20 ]

We should investigate when during rollback lastApplied is unset/reset, and what opportunities the JournalFlusher has to pick up a lastApplied timestamp value before the reset. The JournalFlusher is turned off and on around calling recoverToAStableTimestamp on the storage engine. A solution might be to turn the JournalFlusher off for a larger duration of rollback, so it can't asynchronously grab anything mischievous.

Generated at Thu Feb 08 05:24:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.