-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication, Sharding
-
None
-
Fully Compatible
-
ALL
-
-
Sharding 2019-05-06, Sharding 2019-05-20, Sharding 2019-06-03
-
19
Replication step down requires the ReplicationStateTransitionLock in MODE_X and kills user operations, but it doesn't kill internal operations, like those run by the collection range deleter. If the range deleter runs and enters a prepare conflict retry loop (which waits without yielding locks), it will hang until the prepared transaction modifying the data it is reading commits or aborts. The RSTL can't be taken in exclusive mode until the range deleter operation finishes, so during this time all step down attempts will time out waiting for the RSTL.
This should also be a problem for step up (and other operations that require the RSTL) and may be triggered by other internal operations that can read prepared data, but I've only seen this so far with step down and the range deleter. The step up case might be worse, because a prepared transaction can't commit or abort and unblock an internal operation if there's no primary.
- is related to
-
SERVER-39096 Prepared transactions and DDL operations can deadlock on a secondary, if a reader blocks on a prepared document
- Closed
-
SERVER-40586 step up instead of stepping down in stepdown suites
- Closed
-
SERVER-40641 Ensure TTL delete in prepare conflict retry loop does not block step down
- Closed
- related to
-
SERVER-40700 Deadlock between read prepare conflicts and state transitions
- Closed
-
SERVER-41035 Rollback should kill all user operations before taking RSTL lock in X.
- Closed