[SERVER-39424] Test that DDL operations can't succeed after prepared transactions are aborted during shutdown Created: 07/Feb/19 Updated: 08/Jan/24 Resolved: 30/Apr/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.11 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Samyukta Lanka | Assignee: | Dianna Hohensee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | prepare_testing, txn_storage | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Sprint: | Repl 2019-03-25, Repl 2019-04-08, Storage NYC 2019-05-06 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
Once the UninterruptibleLockGuard work is completed and all UninterruptibleLockGuard s that conflict with prepared transactions are removed, we should verify that DDL operations are successfully killed and they cannot succeed after prepared transactions have been aborted during shutdown. |
| Comments |
| Comment by Githook User [ 01/May/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Author: {'email': 'justin.seyster@mongodb.com', 'name': 'Justin Seyster', 'username': 'jseyster'}Message: | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 30/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Author: {'email': 'dianna.hohensee@10gen.com', 'name': 'Dianna', 'username': 'DiannaHohensee'}Message: | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 25/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I've verified my reasoning is correct by running the create collection command concurrently with shutdown, such that it's lock request is enqueued behind the RSTL lock. The cmd failed in the lock acquisition code, as anticipated
However, I did so with a combination of fail points and server code sleepmillis calls, which does not have sufficient reliability to be committed. The problem is that we close the transportation layer during shutdown before we do anything with locks or interrupting operations. Also, we do not have the infrastructure to handle server shutdown in a parallel shell: my test did not exit gracefully. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 24/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Yes, seems reasonable, I'll create such a test. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 24/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I'd like to at the very least add a test that the above reasoning holds for some DDL operation. Does that seem reasonable? | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 24/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
From what I can see in our code, I think I'm concluding that no work needs to be done for this ticket. This is the shutdown code of interest to us
Once the RSTL lock is enqueued, further CRUD/DDL ops may queue up behind it. We then mark all ongoing ops as killed by calling setKillAllOperations(), and any future new ops will also immediately be marked as killed. Then we wait to acquire the RSTL lock, ensuring that any ops with locks have finished. Now we only have to worry about lock requests enqueued behind/after the RSTL lock. It is not guaranteed that operations themselves will check for interrupt frequently enough to notice that they have been killed. However, the locking system does guarantee this, for lock requests that are not under UninterruptibleLockGuards.
_notify is a CondVarLockGrantNotification and CondVarLockGrantNotification::wait will check for interrupt before returning a successful LockResult
So any ops with lock requests enqueued after the RSTL will not successfully acquire locks for CRUD/DDL ops. I do not think ReplicationCoordinator::canAcceptWritesFor* is an appropriate invariant to make during shutdown to ensure no further writes occur because 1) that check is only intended to stop external clients 2) it is in a high level replication concept and should not be used to control lower level activity 3) ReplicationCoordinator::stepDown called during shutdown seems too racy to depend upon forcing (waiting forever might actually be forever, though I have not looked in depth). CC: milkie | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 24/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
It seems that it can only throw if it times out waiting for the lock or if another stepdown is in progress. Can we wait indefinitely for the lock (since we'll need it eventually to shutdown anyways) and continuously try to step down until it either succeeds or gets a NotMaster error? Alternatively, we could "do something to prevent writes from succeeding" after we release prepared locks. sSepdown was that thing, but if that doesn't work we could try setting canAcceptWritesFor directly, or add a shutdown check where we check canAcceptWritesFor in the shutdown path. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 24/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
judah.schvimer, it appears that the ReplicationCoordinator::stepDown function can still throw errors if force=true is passed into it, so it can still fail. I don't see how we could safely invariant afterwards that canAcceptWritesFor is false, unless ReplicationCoordinator::stepDown can be changed, which is outside of my current expertise – it would take me a long while to figure out the replication code. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 23/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Yes, the above comment as you described is what we discussed with milkie would be the work on this ticket. It's not directly testing it, but it gives us confidence with the invariant and the existing test coverage around shutting down secondaries. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 23/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
judah.schvimer, is what you delineate in the last comment the work you want done for this ticket? If I understand correctly, you'd like
This ticket is no longer about testing that DDL ops fail after prepared transactions are interrupted/aborted during shutdown? | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 05/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Now that we step down before shutdown, this is now identical to stepdown. The work on this ticket should make the stepdown in shutdownTask call force stepdown on failure here. We will then add an invariant that canAcceptWritesFor is set to false here after the stepdown completes. We will also do something to prevent reads on prepared documents from succeeding after the prepared transactions are aborted in shutdown. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 20/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
siyuan.zhou, good point. Using the stepdown procedure wouldn't completely protect us from improper uses of UninterruptibleLockGuards. If a DDL op with an UninterruptibleLockGuard queued behind the RSTL, then the RSTL could still be acquired by the shutdown. We would then yield prepared locks and after the shutdown released the RSTL, the DDL op could acquire a lock that should have conflicted with the prepared transaction and finish its operation erroneously. By this logic, improper UninterruptibleLockGuards would also be a problem during stepdown when we yield locks from prepared transactions after acquiring the RSTL. In that case though, we do change "canAcceptWritesFor" to false, so that will likely protect the DDL op from succeeding. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 19/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
Do you mean the uninterruptible DDL are waiting for RSTL while the shutdown is running? We don't hold RSTL for the whole shutdown process though. After shutdown releases the RSTL, can these uninterruptible DDL ops continue?
Speaking of new operations, according to mira.carey@mongodb.com, ops that come in after setKillAllOperations are after we've shutdown the transport layer, so there won't be any new user ops. Internal processes after setKillAllOperations are possible though. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 19/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
If the DDL were not interruptible, then we would not be able to acquire the RSTL at all. Are you saying that if they were uninterruptible then "mark any new operations interrupted on their creation" would not suffice for preventing new operations? | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 19/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I feel I'm missing why RSTL is a barrier. If the DDL is not interruptible, can they continue after RSTL is released by shutdown? | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 19/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I agree we don't need the loop or the reader/writer differentiation.
This is the main change I'm proposing. siyuan.zhou, why would we need to extend the scope of the RSTL? This would be using the RSTL as a barrier since we cannot acquire it while operations hold locks and new operations cannot acquire locks as you pointed out. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 18/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I don't see why we need to kill ops in a loop or treat readers and writers differently. Killing operations on shutdown will also mark any new operations interrupted on their creation. If we want to separate the killing of unprepared and the yielding of prepared transactions and move the latter to under RSTL, we also need to extend the scope of RSTL on shutdown, but up to where is still a question. Up to where we acquire the global lock? For testing, I think adding one test case for the above scenario is a good idea to verify the mechanism is right. Then the concurrency tests sound a good solution to cover this kind of bugs, given their past coverage on subtle concurrency bugs. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 15/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
This may get done anyways by | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 15/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Per conversation with milkie, the race seems to come from the fact that we yield prepared transaction locks before acquiring the RSTL. Stepdown, in contrast, aborts unprepared transactions and then acquires the RSTL, and then yields prepared transaction locks. suganthi.mani points out that stepdown is a bit overly-complex for shutdown. It loops killing operations to avoid killing readers, but shutdown wants to kill readers and writers. If we kill all operations on shutdown, then we don't need this looping behavior. I think we can do two things on this ticket. CC siyuan.zhou for any other thoughts. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 15/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I'm not sure the test would have value unless we included all possible operations that could be blocked by the prepare, to detect when a programmer might change any of their lock acquisitions to Uninterruptible. Even if we could, the test would not age gracefully, as we could always add new operations in the future and forget to add them to the test. How about this: Instead of trying to test for a situation that we hope never happens in the future, we could set the lock manager into a mode that would not allow any further lock requests to be granted after prepared transactions are aborted. That is the specific situation that we are trying to avoid here. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 15/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
milkie, do you think the following test adds value?
If index creation were uninterruptible, then it would succeed here after shutdown aborts prepared transactions. The index build would acquire the RSTL before shutdown and shutdown would be blocked on the index build to complete successfully. Any DDL op would work for this test. One problem is that this somewhat only tests that we kill all operations and the interruptibility of the specific DDL op we choose for the test. All of this will also be tested in SERVER-39994 for all DDL ops, but in a less targeted way. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 14/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I am having trouble imagining a test that would pass with today's code but fail if a DDL operation in the future hits an uninterruptible lock and would subsequently succeed at shutdown time, as per the ticket title. In particular, writing a test for detecting that some situation doesn't happen is tricky unless we have good traction on what the code logic would be when it does happen. I don't think we can proceed with this ticket until more details are added to the description. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Samyukta Lanka [ 11/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I think tests like this might be useful to make sure that someone doesn't make one of these operations uninterruptible in the future without realizing they could conflict with prepared transactions. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 11/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
If they are interruptible and we interrupt them, how would they sneak in? Like what are we looking for? | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Samyukta Lanka [ 11/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
judah.schvimer we might want to go through all the places that had a conflicting UninterruptibleLockGuard and make sure those cannot sneak in during shutdown, but I'm not sure if you find that necessary. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 11/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
The first thing shutdown does is go through the stepdown command codepath; therefore, your test proposal could be simplified as "step down the node" instead of "shut down the node". I'm unconvinced that we need to write separate shutdown tests, but if we do, we should write the stepdown test first, and then simply substitute "shutdown" for "stepdown" at the appropriate point in the test. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Samyukta Lanka [ 11/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
This came out of | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 08/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I think this came out of | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 08/Mar/19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
samy.lanka, can you please clarify what prepare work this is intended to test since there is no shutdown related linked ticket and how you expect the test to be structured? |