[SERVER-70115] Resharding Coordinator and Recipient Persist Invalid Start/End Times to State Document Created: 29/Sep/22 Updated: 29/Oct/23 Resolved: 17/Jan/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 6.3.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Brett Nawrocki | Assignee: | Adrian Gonzalez Montemayor |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Sharding 2022-10-17, Sharding 2022-12-12, Sharding NYC 2022-12-26, Sharding NYC 2023-01-09, Sharding NYC 2023-01-23 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 20 | ||||||||
| Story Points: | 3 | ||||||||
| Description |
|
The ReshardingCoordinatorService persists the start and end times for phase transitions as part of its state document. However, during phase transitions, the coordinator is persisting these times before setting them. The ReshardingRecipeintService has a similar issue when transitioning to applying and strict consistency. It's worth noting that the recipient service will set the copying start time before writing this value to disk. However, this has the problem of leaving a potentially invalid start time in memory if persisting to disk fails. This issue affects the following functions involving state transitions:
For all of these functions, the timestamp for the start/end times should be chosen, written to disk, and then used to update the metrics in-memory. |
| Comments |
| Comment by Githook User [ 17/Jan/23 ] |
|
Author: {'name': 'Adrian Gonzalez', 'email': 'adriangonzalezmontemayor@gmail.com', 'username': 'adriangzz'}Message: |
| Comment by Max Hirschhorn [ 03/Oct/22 ] |
|
Idea would be to add a struct containing boost::optional<Date_t> for each of the start and end times and use that in RecipientStateMachine::_transitionState() and for ReshardingCoordinator's writeToCoordinatorStateNss() to have set the individual start and end times to be non boost::none on the updatedCoordinatorDoc and to have writeToCoordinatorStateNss() append to the setBuilder to update the on-disk state. |
| Comment by Brett Nawrocki [ 29/Sep/22 ] |
|
max.hirschhorn@mongodb.com Looking at this again, it actually seems like only in ReshardingRecipientService::RecipientStateMachine::_transitionToCloning do we do it in this way. I assumed it was being done in a consistent manner across state transitions. I think probably what we should really be doing is getting a timestamp, writing that down, and then updating the metrics with that timestamp for the reasons you mentioned. I'll rewrite the ticket. |
| Comment by Max Hirschhorn [ 29/Sep/22 ] |
The general rule we've followed in resharding and in other server components is to only update the in-memory state after the on-disk state has been updated. This way we don't need to undo the update to the in-memory state if the write to the on-disk state fails. What was the motivation for doing the reverse with these resharding metrics? |