Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.0.13, 6.0.2
Affects Version/s: 5.3.0, 5.0.0, 6.0.0-rc12
Component/s: None
Labels:
- sharding-nyc-subteam2

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.0
Story Points:
3
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

If a recipient receives a _shardsvrReshardingOperationTime command right after it has transitioned to the "applying" state (i.e. oplogEntriesApplied = 0), 'remainingOperationTimeEstimatedSecs' would be calculated using 'bytesCopied' and 'bytesToCopy', and the elapsed time of the "cloning" state.

It turns out that the start time of the "cloning" state only gets initialized when a recipient transitions from "create-collection" to the "cloning" state. If the ReshardingRecipientService instance is created while the the recipient (i.e. on restart or stepup) is already the "cloning" state, we would skip the state transition so the start time would be uninitialized. Consequently, 'remainingOperationTimeEstimatedSecs' would be 0 since the elapsed time for the cloning state would be 0. The issue here is also that there isn't a mechanism for persisting the start time and recovering it on stepup. Returning remainingOperationTimeEstimatedSecs=0 would cause the coordinator to think that it can start the critical section and the resharding operation to fail with ReshardingCriticalSectionTimeout if the recipient doesn't manage to enter the "strict-consistency" state within the timeout .

The same bug exists for the start time for the "applying" state.

related to

SERVER-67653 Resharding coordinator can incorrectly conclude that it can start the critical section although on one recipient the oplog applier hasn't caught up with the oplog fetcher

Closed

SERVER-92978 Make sure resharding recipients can restore 'approxDocumentsToCopy' and 'approxBytesToCopy' metrics after failover

Closed

Assignee:: Andrew Witten (Inactive)
Reporter:: Cheahuychou Mao
Participants:: Andrew Witten, Cheahuychou Mao, Githook User, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Jun 29 2022 03:44:16 PM UTC
Updated:: Aug 01 2024 02:43:54 PM UTC
Resolved:: Aug 11 2022 02:39:25 PM UTC
Confidence Status Last Update:: 26/Jul/22 5:09 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates