-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: 7.0.0, 8.0.0
-
Component/s: None
-
None
-
Catalog and Routing
-
CAR Team 2025-05-12
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The setAlwaysInterruptAtStepDownOrUp_UNSAFE() function on the operation context is unsafe in a sense that if a stepdown happens during the run of that function, the said stepdown (and the interruption) would be missed.
If a component tries to acquire the DDL lock after it became a secondary, it will hang for 5 minutes (the default timeout for the DDL lock acquire) since the DDL lock acquisition waits for the recovery of the ShardingDDLCoordinator, what will never happen because the node became a secondary.
Note that this can only happen if the node became a primary, started to recovery and during that i became a secondary again.
On 8.1+ it is fixed in a way that the ShardingDDLCoordinator implement a Recoverable interface, and that implementation will always be right about the actual state of the node (triggers on the change from Recovering instead of waiting for Recovered). For further information check SERVER-90371
This ticket is to fix on 8.0 and 7.0 in a simpler way with an "optimistic double check lock".
First we try to wait for DDL coordinator recovery with a relatively small timeout (100ms, but configurable), if it fails we double check if we are still the primary by taking the RSTL (through the global lock) and check our role. If we are not the primary anymore we can interrupt the ddl acquisition. If we are still the primary, we can wait for the DDLLock acquisition for a longer time.
Before waiting for the recovery, we have to make sure the context is marked as interruptible on stepdown. After the primary checking we can be sure, we won't miss any stepdown interrupts