Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 9.0.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Cluster Scalability
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
ClusterScalability 27Apr-11May
Linked BF Score:
200
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The resharding coordinator is written generally as a large future chain, with continuations that are scheduled sequentially. This means that, although various continuations may be running on different threads of the executor, they will not run concurrently, because the next continuation is scheduled only after the previous one completes. This saves us from needing to hold locks when doing reads of the coordinator's state from the main future chain, as that same future chain is the only writer, and therefore there is no race.

However, this property is violated when calling _awaitAllParticipantShardsDone in _onAbortCoordinatorAndParticipants, because this future is awaited using the WithCancellation helper. This means that it's possible for the main future chain to proceed past this wait and proceed from here when WithCancellation returns a CallbackCancelled error. However, given that this happens only when the stepdown token is cancelled, it's likely that the main executor will refuse work, and we'll actually proceed from here on the cleanup executor.

Meanwhile, the original _awaitAllParticipantShardsDone could still be doing work on a separate executor thread, still unaware that this node is stepping down because it hasn't gotten to a window where interrupts are checked. This means that we may be trying to remove the state document and update the in-memory state concurrently with the main thread's cleanup logic accessing it to perform final logging.

This leads to the issue seen in BF-43147.

Assignee:: Brett Nawrocki
Reporter:: Brett Nawrocki
Participants:: Brett Nawrocki, Githook User, TPM Jira Automations Bot
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: May 07 2026 07:21:50 PM UTC
Updated:: May 08 2026 10:11:03 PM UTC
Resolved:: May 08 2026 10:11:03 PM UTC

Details

Description

Attachments

Activity

People

Dates