[SERVER-53920] Periodically obtain remainingOperationTimeEstimatedMillis estimates from recipients for use by the ReshardingCoordinator Created: 20/Jan/21  Updated: 29/Oct/23  Resolved: 20/Apr/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.0.0-rc0

Type: New Feature Priority: Major - P3
Reporter: Lamont Nelson Assignee: Amirsaman Memaripour
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-autocommits
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-55684 Remove resharding's RecipientStateEnu... Closed
is depended on by SERVER-56660 Remove need for reshardingCoordinator... Closed
Gantt Dependency
has to be done before SERVER-55683 Remove waiting for minimum duration f... Closed
Problem/Incident
causes SERVER-56215 Ensure current client is set in Coord... Closed
Related
is related to SERVER-53919 Add a remainingReshardingOperationTim... Closed
is related to SERVER-53921 Engage critical section once all reci... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2021-03-08, Sharding 2021-03-22, Sharding 2021-04-05, Sharding 2021-04-19, Sharding 2021-05-03
Participants:
Linked BF Score: 134
Story Points: 2

 Description   

Add a remainingReshardingOperationTimeMillisThreshold server parameter to control when the coordinator should engage the critical section (default value of 2s)
Should contact each recipient to gather statistics.
Should continuously monitor while a ReshardingCoordinator service instance exists.
Should stop monitoring when the coordinator instance exits or we reach the critical section of the coordinator.
Should provide ability to determine if all recipients report they can finish within the remainingReshardingOperationTimeMillisThreshold.
Engage critical section once all recipients report they can finish within remainingOperationTimeMillisThreshold.



 Comments   
Comment by Githook User [ 20/Apr/21 ]

Author:

{'name': 'Amirsaman Memaripour', 'email': 'amirsaman.memaripour@mongodb.com', 'username': 'samanca'}

Message: SERVER-53920 Periodically obtain remainingOperationTimeEstimatedMillis estimates from recipients for use by the ReshardingCoordinator
Branch: master
https://github.com/mongodb/mongo/commit/8c83c443fc03fbb4cbe8323062d011738a284107

Comment by Max Hirschhorn [ 24/Feb/21 ]
  • Wait for all recipients to enter the steady state. I believe that corresponds to the continuation that comes after this one.

Yes, once all recipients have reached state RecipientStateEnum::kSteadyState, the resharding operation becomes eligible to be committed. (There'd be no benefit to blocking writes on donor shards while the recipients are still doing their initial collection clone.)

To be slightly more precise, it corresponds to _reshardingCoordinatorObserver->awaitAllRecipientsFinishedApplying() becoming ready.

  • Start a new observer service (this ticket), and keep it running so long as the resharding operation is not cancelled and the coordinator has not entered a critical section. This would postpone the execution of any continuation after this point.
  • Once the maximum is less than the threshold (i.e., remainingReshardingOperationTimeMillisThreshold), notify the coordinator so that it enters the critical section. I'm not sure what state corresponds to the critical section for the coordinator and donors.

The action of the resharding coordinator this ticket should postpone is specifically the transition to CoordinatorStateEnum::kMirroring (to be renamed to CoordinatorStateEnum::kBlockingWrites or similar as part of SERVER-54512). The goal of this ticket is defer donor shards starting to block writes until it appears that the recipient shards are mostly caught up.

Comment by Amirsaman Memaripour [ 23/Feb/21 ]

max.hirschhorn and lamont.nelson, here is my understanding of what this ticket should do, along with a few questions:

  • Wait for all recipients to enter the steady state. I believe that corresponds to the continuation that comes after this one.
  • Start a new observer service (this ticket), and keep it running so long as the resharding operation is not cancelled and the coordinator has not entered a critical section. This would postpone the execution of any continuation after this point.
  • Gather the currentOp output for each recipient, and calculate the maximum of the collected remainingOperationTimeEstimatedMillis for the recipients.
  • Once the maximum is less than the threshold (i.e., remainingReshardingOperationTimeMillisThreshold), notify the coordinator so that it enters the critical section. I'm not sure what state corresponds to the critical section for the coordinator and donors.
  • Once the coordinator enters the critical section and persists the state change, donors will get notified through the coordinator document.
Generated at Thu Feb 08 05:32:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.