-
Type:
Bug
-
Resolution: Won't Do
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Replication
-
ALL
-
Repl 2025-03-17
-
None
-
None
-
None
-
None
-
None
-
None
-
None
When a node begins a stepup/stepdown, the replication coordinator will enqueue to acquire the RSTL in X mode, and fassert if it can't acquire it within 30 seconds. We will attempt to kill operations, but some internal threads are marked unkillable (example, example).
When one of these threads attempts to acquire the global lock, it calls into the _takeGlobalAndRSTLLocks function. Notably, the lock and ticket ordering is such that a locker will take the RSTL first, then acquire a execution control ticket, and then acquire the global lock. The deadline to acquire the locks are initialized by the caller, but the default is Date_t::max(). Note here that the deadline to acquire a ticket inherits this value, so the default time to acquire a ticket is the maximum timeout.
If the execution control queue is overloaded, threads can stall while waiting to acquire a ticket. Normally, user operations are interruptible at this point, so they are killed by the kill op thread. However, for internal unkillable threads, they can hang here for significant lengths of time, while holding the RSTL (see lock ordering above). This can cause the stepdown RSTL acquisition to time out, crashing the node.
This was surfaced via customer AFs. I don't think we have any passthrough suites or integration tests that test stalled ticket queues. I'm also not certain why we acquire a ticket after we take the RSTL. I wonder if we can swap that ordering.
- related to
-
SERVER-70127 Default system operations to be killable by stepdown
-
- Closed
-