-
Type:
Bug
-
Resolution: Works as Designed
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
ALL
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The reshardingCriticalSectionTimeoutMillis parameter limits how long the critical section is held during resharding by scheduling a callback that triggers an error if the timeout is exceeded. SERVER-84709 made changes to ensure that the callback gets re-scheduled in the case of a stepdown. However, the callback only gets re-scheduled if the coordinator is in the blocking-write phase.
Without the timeout callback being re-scheduled in the committing phase, resharding could potentially block writes longer than the set critical section timeout (default 5 seconds), defeating the purpose of the timeout parameter. However, it is not as simple because aborting during the committing phase may not be safe after commit messages have been sent to participants.
One scenario that could happen:
- A coordinator transitioned to kCommitting.
- Failover occurs before the participants are told to commit.
- Upon step, the coordinator resumes the commit.
- Any of the work left in the commit protocol takes longer than the time remaining of the critical section. Example: bad replication lag while waiting for majority here.
- The contract of the reshardingCriticalSectionTimeoutMillis parameter is broken.
Is it intended behavior? Investigate how to properly handle the timeout during the committing phase after a failover on the coordinator.
- is related to
-
SERVER-84709 Resharding critical section timeout is not honored on stepdown
-
- Closed
-