-
Type:
Task
-
Resolution: Done
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Replication
-
Fully Compatible
-
Repl 2025-03-17, Repl 2025-03-31, Repl 2025-04-14
-
None
-
0
-
None
-
None
-
None
-
None
-
None
-
None
With the libunwind fix, we've been able to continue looking at stack traces for AFs caused by an RSTL acquisition timeout. In a few cases, we've noticed that the crash did not indicate an actual deadlock. Instead, we saw that very slow nodes were making progress but ultimately could not complete the lock acquisitions in under 30 seconds. In one case, the killOp thread was making steady progress killing user operations, but there were too many operations to kill for it to complete in time. In another, the stepdown thread was waiting on the noop writer, which held the RSTL. Although the node fasserted, the stack trace output indicated the noop writer actually completed its task some time after the fassert, pointing to slowness rather than deadlock.
We should revisit our RSTL acquisition timeout strategy. One option is to determine whether 30 seconds is too quick. If so, we could increase the timeout to one or five minutes. Another option is to do some investigation on whether or not this timeout is surfacing any deadlocks, and if it's worth keeping node crashes in customer clusters. We've had 1100 manifestations of the RSTL timeout on v8.0 since November, and I suspect a number of them may not actually be deadlocks. If we find that the number of deadlocks caught is low, we could explore another solution that would provide us with the debuggability, but avoid crashing customer nodes (unless we hit a longer timeout).