[SERVER-70115] Resharding Coordinator and Recipient Persist Invalid Start/End Times to State Document Created: 29/Sep/22  Updated: 29/Oct/23  Resolved: 17/Jan/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 6.3.0-rc0

Type: Bug Priority: Major - P3
Reporter: Brett Nawrocki Assignee: Adrian Gonzalez Montemayor
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-70111 ShardingDataTransform Metrics Interfa... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2022-10-17, Sharding 2022-12-12, Sharding NYC 2022-12-26, Sharding NYC 2023-01-09, Sharding NYC 2023-01-23
Participants:
Linked BF Score: 20
Story Points: 3

 Description   

The ReshardingCoordinatorService persists the start and end times for phase transitions as part of its state document. However, during phase transitions, the coordinator is persisting these times before setting them.

The ReshardingRecipeintService has a similar issue when transitioning to applying and strict consistency.

It's worth noting that the recipient service will set the copying start time before writing this value to disk. However, this has the problem of leaving a potentially invalid start time in memory if persisting to disk fails.

This issue affects the following functions involving state transitions:

For all of these functions, the timestamp for the start/end times should be chosen, written to disk, and then used to update the metrics in-memory.



 Comments   
Comment by Githook User [ 17/Jan/23 ]

Author:

{'name': 'Adrian Gonzalez', 'email': 'adriangonzalezmontemayor@gmail.com', 'username': 'adriangzz'}

Message: SERVER-70115 Resharding Coordinator and Recipient Persist Invalid Start/End Times to State Document
Branch: master
https://github.com/mongodb/mongo/commit/2731779814f9de99958826c3aeca1ee93ece4743

Comment by Max Hirschhorn [ 03/Oct/22 ]

Idea would be to add a struct containing boost::optional<Date_t> for each of the start and end times and use that in RecipientStateMachine::_transitionState() and for ReshardingCoordinator's writeToCoordinatorStateNss() to have set the individual start and end times to be non boost::none on the updatedCoordinatorDoc and to have writeToCoordinatorStateNss() append to the setBuilder to update the on-disk state.

Comment by Brett Nawrocki [ 29/Sep/22 ]

max.hirschhorn@mongodb.com Looking at this again, it actually seems like only in ReshardingRecipientService::RecipientStateMachine::_transitionToCloning do we do it in this way. I assumed it was being done in a consistent manner across state transitions.

I think probably what we should really be doing is getting a timestamp, writing that down, and then updating the metrics with that timestamp for the reasons you mentioned. I'll rewrite the ticket.

Comment by Max Hirschhorn [ 29/Sep/22 ]

These values should be set before they are written down, as is already being done on the recipient.

The general rule we've followed in resharding and in other server components is to only update the in-memory state after the on-disk state has been updated. This way we don't need to undo the update to the in-memory state if the write to the on-disk state fails. What was the motivation for doing the reverse with these resharding metrics?

Generated at Thu Feb 08 06:15:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.