Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Won't Do
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- repl-shortlist

Assigned Teams:

Replication
Operating System:
ALL
Sprint:
Repl 2025-03-17
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When a node begins a stepup/stepdown, the replication coordinator will enqueue to acquire the RSTL in X mode, and fassert if it can't acquire it within 30 seconds. We will attempt to kill operations, but some internal threads are marked unkillable (example, example).

When one of these threads attempts to acquire the global lock, it calls into the _takeGlobalAndRSTLLocks function. Notably, the lock and ticket ordering is such that a locker will take the RSTL first, then acquire a execution control ticket, and then acquire the global lock. The deadline to acquire the locks are initialized by the caller, but the default is Date_t::max(). Note here that the deadline to acquire a ticket inherits this value, so the default time to acquire a ticket is the maximum timeout.

If the execution control queue is overloaded, threads can stall while waiting to acquire a ticket. Normally, user operations are interruptible at this point, so they are killed by the kill op thread. However, for internal unkillable threads, they can hang here for significant lengths of time, while holding the RSTL (see lock ordering above). This can cause the stepdown RSTL acquisition to time out, crashing the node.

This was surfaced via customer AFs. I don't think we have any passthrough suites or integration tests that test stalled ticket queues. I'm also not certain why we acquire a ticket after we take the RSTL. I wonder if we can swap that ordering.

related to

SERVER-70127 Default system operations to be killable by stepdown

Closed

Assignee:: Sean Zimmerman
Reporter:: Ali Mir
Participants:: Ali Mir, Sean Zimmerman
Votes:: 0 Vote for this issue
Watchers:: 16 Start watching this issue

Created:: Feb 28 2025 06:31:48 PM UTC
Updated:: Apr 11 2025 08:38:59 PM UTC
Resolved:: Mar 10 2025 06:33:02 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates