Loading...

XML

Word

Printable

JSON

Background

Participant hangs can arise from several hard to observe conditions:

Interference from stale coordinator commands
Hidden bug where participant state transition is not durable when the coordinator believes it is, resulting in lost promise emplacements across failovers
Participant being unavailable or repeatedly failing commands in a way that makes resharding appear stuck rather than failed

This makes quickly determining the cause of a hang difficult unless there is observability that makes those hard to observe conditions visible.

Log	Why it helps
Entry and exit logs for coordinator commands (with resharding UUID, lsid, txnNum and participant phase).	Reconstructs command flow to determine cause of hang
Coordinator command retry attempt with reason.	Makes retries visible and helps identify transient failures
Logs every time a participant is waiting for a promise emplaced by coordinator and when it’s resolved.	Quickly identify what unresolved promise is blocking the participant

Metric	Why it helps
Counter for coordinator retries, broken down by command type.	Makes retries visible and shows what repeated transient failure is causing a hang
Time spent waiting on promises to be emplaced by coordinator commands, broken down by command type	See in FTDC where participants are spending time blocked waiting for signal from coordinator