[SERVER-67650] Resharding recipient can return remainingOperationTimeEstimatedSecs=0 when the oplog applier hasn't caught up with the oplog fetcher Created: 29/Jun/22  Updated: 29/Oct/23  Resolved: 11/Aug/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.3.0, 5.0.0, 6.0.0-rc12
Fix Version/s: 5.0.13, 6.0.2

Type: Bug Priority: Major - P3
Reporter: Cheahuychou Mao Assignee: Andrew Witten (Inactive)
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-67653 Resharding coordinator can incorrectl... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Participants:
Story Points: 3

 Description   

If a recipient receives a _shardsvrReshardingOperationTime command right after it has transitioned to the "applying" state (i.e. oplogEntriesApplied = 0), 'remainingOperationTimeEstimatedSecs' would be calculated using 'bytesCopied' and 'bytesToCopy', and the elapsed time of the "cloning" state.

It turns out that the start time of the "cloning" state only gets initialized when a recipient transitions from "create-collection" the "cloning" state. If the ReshardingRecipientService instance is created while the the recipient (i.e. on restart or stepup) is already the "cloning" state, we would skip the state transition so the start time would be uninitialized. Consequently, 'remainingOperationTimeEstimatedSecs' would be 0 since the elapsed time for the cloning state would be 0. The issue here is also that There isn't a mechanism for persisting the start time and recovering it on stepup. Returning remainingOperationTimeEstimatedSecs=0 would cause the coordinator to think that it can start the critical section and the resharding operation to fail with ReshardingCriticalSectionTimeout if the recipient doesn't manage to enter the "strict-consistency" state within the timeout .

The same bug exists for the start time for the "applying" state. 



 Comments   
Comment by Githook User [ 12/Sep/22 ]

Author:

{'name': 'Andrew Witten', 'email': 'andrew.witten@mongodb.com', 'username': 'awitten1'}

Message: SERVER-67650 persist additional recipient resharding metrics

(cherry picked from commit bbc0bb5e9878e48b6bc3b666affbdf102b379450)

fix comilation errors after cherry-pick
Branch: v5.0
https://github.com/mongodb/mongo/commit/5a750e26658c06386d16291e4cf55e05e3379dbe

Comment by Andrew Witten (Inactive) [ 12/Sep/22 ]

Code review:

BACKPORT-13391

https://github.com/10gen/mongo/pull/7283

Base branch: v5.0.

Comment by Max Hirschhorn [ 01/Sep/22 ]

Author:

{'name': 'Andrew Witten', 'email': 'andrew.witten@mongodb.com', 'username': 'awitten1'}

Message: SERVER-67650 persist additional recipient resharding metrics
Branch: v6.0
https://github.com/mongodb/mongo/commit/bbc0bb5e9878e48b6bc3b666affbdf102b379450

Generated at Thu Feb 08 06:08:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.