Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Done
Priority: Major - P3
Fix Version/s: N/A
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Replication
Backwards Compatibility:
Fully Compatible
Sprint:
Repl 2025-03-17, Repl 2025-03-31, Repl 2025-04-14
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

With the libunwind fix, we've been able to continue looking at stack traces for AFs caused by an RSTL acquisition timeout. In a few cases, we've noticed that the crash did not indicate an actual deadlock. Instead, we saw that very slow nodes were making progress but ultimately could not complete the lock acquisitions in under 30 seconds. In one case, the killOp thread was making steady progress killing user operations, but there were too many operations to kill for it to complete in time. In another, the stepdown thread was waiting on the noop writer, which held the RSTL. Although the node fasserted, the stack trace output indicated the noop writer actually completed its task some time after the fassert, pointing to slowness rather than deadlock.

We should revisit our RSTL acquisition timeout strategy. One option is to determine whether 30 seconds is too quick. If so, we could increase the timeout to one or five minutes. Another option is to do some investigation on whether or not this timeout is surfacing any deadlocks, and if it's worth keeping node crashes in customer clusters. We've had 1100 manifestations of the RSTL timeout on v8.0 since November, and I suspect a number of them may not actually be deadlocks. If we find that the number of deadlocks caught is low, we could explore another solution that would provide us with the debuggability, but avoid crashing customer nodes (unless we hit a longer timeout).

Assignee:: Sean Zimmerman
Reporter:: Ali Mir
Participants:: Ali Mir, Sean Zimmerman
Votes:: 0 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: Mar 03 2025 05:54:41 PM UTC
Updated:: Jul 18 2025 01:28:57 PM UTC
Resolved:: Apr 04 2025 08:57:46 PM UTC

Details

Description

Attachments

Activity

People

Dates