[SERVER-57545] Stepping down while stepping up with a transaction prepared results in a broken node Created: 08/Jun/21  Updated: 13/Jul/21  Resolved: 13/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthew Russotto Assignee: Vesselina Ratcheva (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro.SERVER-54545    
Issue Links:
Backports
Depends
Duplicate
duplicates SERVER-57756 Race between concurrent stepdowns and... Closed
Related
is related to SERVER-58440 Mark signalDrainComplete as noexcept Closed
Operating System: ALL
Backport Requested:
v5.0, v4.4, v4.2
Sprint: Repl 2021-06-28, Repl 2021-07-12, Repl 2021-07-26
Participants:

 Description   

It is possible for a stepdown to start due to some other primary stepping up while we are still holding the RSTL from a step-up attempt. If we do this while we have a transaction prepared, we will uassert when trying to check out a session to restore the prepared transactions locks.

https://github.com/mongodb/mongo/blob/b9c4dc61d38edd4ae1c4953dbc646fac633d78d0/src/mongo/db/session_catalog_mongod.cpp#L271

The uassert will cause use to exit signalDrainComplete() without actually signalling that the drain is complete. At that point the oplog applier (and thus replication) will be stuck.

In addition to fixing this, we should probably mark signalDrainComplete() as "noexcept" so we crash instead of hanging if anything similar happens.


Generated at Thu Feb 08 05:42:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.