-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Summary
TLA+ specification at src/mongo/tla_plus/Sharding/ReshardingCoordinator/ models the resharding coordinator state machine and the SERVER-115139 race. TLC produces a counterexample showing that an _shardsvrAbortReshardCollection arriving in the (Stepup, InitCancelSource) window is silently dropped under the buggy implementation.
Files
- ReshardingCoordinator.tla (511 lines) — actors: 1 Coordinator, 2 Donors, 2 Recipients. State per actor: durable phase, primary, hasCancelSource, pendingAbort.
- {{MCReshardingCoordinator.
{tla,cfg}
}} — green cfg (BugMode = FALSE). All invariants pass.
- MCReshardingCoordinatorAbortBeforeInit.cfg — bug cfg (BugMode = TRUE). Encodes the
SERVER-115139race. - OWNERS.yml — 10gen/server-sharding.
Invariants checked
- AbortAlwaysHandled — if abortRequested, the abort must be observed somewhere or queued in pendingAbort.
- NoCommitAfterAbort — coordinator does not commit through an observed abort.
- EveryStepupInitializesCancelSource — no abort sits in pendingAbort on an actor that already has its cancellation source.
- StepdownDestroysCancelSource, CoordMonotoneOnSuccess, ParticipantsRespectCoordinator, TypeOK.
TLC verdicts
| cfg | states / distinct | depth | result |
|---|---|---|---|
| green (BugMode = FALSE) | 295,339 / 62,683 | 25 | all invariants hold |
| bug (BugMode = TRUE) | 6,803 / 2,073 | 8 | AbortAlwaysHandled violated |
Counterexample (bug cfg, 6 states)
- Init: all actors primary, all hasCancelSource = TRUE.
- CoordAdvance: coordState = coordInitializing.
- Stepdown(c): primary[c] = FALSE, hasCancelSource[c] = FALSE.
- AbortRequest(c): abortRequested = TRUE; BugMode drops it — abortObserved stays FALSE, pendingAbort[c] stays FALSE.
- Stepup(c): primary[c] = TRUE, hasCancelSource[c] still FALSE (init gap).
- InitCancelSource(c): hasCancelSource[c] = TRUE, no queued abort to drain.
Final state: abortRequested = TRUE, abortObserved = FALSE, all hasCancelSource = TRUE, no pendingAbort. AbortAlwaysHandled is violated — the abort was accepted by the router but is invisible to every cancellation source.
Reproducing
cd src/mongo/tla_plus ./download-tlc.sh ./model-check.sh Sharding/ReshardingCoordinator # green: pass cp Sharding/ReshardingCoordinator/MCReshardingCoordinatorAbortBeforeInit.cfg \ Sharding/ReshardingCoordinator/MCReshardingCoordinator.cfg ./model-check.sh Sharding/ReshardingCoordinator # bug: AbortAlwaysHandled violated
Related
SERVER-115139— ReshardingDonorService cannot reliably handle the case where _shardsvrAbortReshardCollection command comes in before the cancellation source is initialized upon stepup recovery
- is related to
-
SERVER-115139 Resharding{Donor,Recipient}Service cannot reliably handle the case where _shardsvrAbortReshardCollection command comes in before the cancellation source is initialized upon stepup recovery
-
- Closed
-