-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Cluster Scalability
-
ClusterScalability 22Jun-6Jul
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Background
Participant hangs can arise from several hard to observe conditions:
- Interference from stale coordinator commands
- Hidden bug where participant state transition is not durable when the coordinator believes it is, resulting in lost promise emplacements across failovers
- Participant being unavailable or repeatedly failing commands in a way that makes resharding appear stuck rather than failed
This makes quickly determining the cause of a hang difficult unless there is observability that makes those hard to observe conditions visible.
Mitigation
Logs
| Log | Why it helps |
| Entry and exit logs for coordinator commands (with resharding UUID, lsid, txnNum and participant phase). | Reconstructs command flow to determine cause of hang |
| Coordinator command retry attempt with reason. | Makes retries visible and helps identify transient failures |
| Logs every time a participant is waiting for a promise emplaced by coordinator and when it’s resolved. | Quickly identify what unresolved promise is blocking the participant |
FTDC Metrics
| Metric | Why it helps |
| Counter for coordinator retries, broken down by command type. | Makes retries visible and shows what repeated transient failure is causing a hang |
| Time spent waiting on promises to be emplaced by coordinator commands, broken down by command type | See in FTDC where participants are spending time blocked waiting for signal from coordinator |