Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.0.13, 6.0.2, 6.1.0-rc3, 6.2.0-rc0
Affects Version/s: None
Component/s: None
Labels:
- sharding-nyc-subteam2

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v6.1, v6.0, v5.0
Story Points:
3
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The resharding coordinator queries all recipients for an estimation of the remaining time for the active resharding operation of participant shards in CoordinatorCommitMonitor::queryRemainingOperationTimeForRecipients using command _shardsvrReshardingOperationTime.

The recipient shards handle that command here. The function ReshardingMetrics::getOperationRemainingTime could possibly return boost::none. In that case, the "recipientMillis" field of the recipient's response to the coordinator will be omitted.

If all participants were to omit this field, then these if statements won't be entered and when the participant reads the max remaining time (here) remainingTimes.max would still be 0 and remainingTimes.min would still be Milliseconds::max (as they were initialized). In particular, the invariant mentioned here would fail.

The effect of this, is that a recipient returning an empty "remainingMillis" field is equivalent to it returning "remainingMillis: 0." This is a bug in at least one case: where _shardsvrReshardingOperationTime is run against a recipient shard before the recipient shard has restored its metrics (during a step up).

As a result, the coordinator would, believing that recipientMillis was under the threshold for all recipients, prematurely begin the critical section, and the resharding operation would fail with ReshardingCriticalSectionTimeout if the recipient above doesn't manage to enter the "strict-consistency" state within the timeout.

is related to

SERVER-67650 Resharding recipient can return remainingOperationTimeEstimatedSecs=0 when the oplog applier hasn't caught up with the oplog fetcher

Closed

SERVER-68783 Recipient shard may incorrectly return 0 milliseconds remaining in resharding

Closed

Assignee:: Andrew Witten (Inactive)
Reporter:: Cheahuychou Mao
Participants:: Andrew Witten, Cheahuychou Mao, Githook User, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Jun 29 2022 06:01:11 PM UTC
Updated:: Oct 29 2023 09:36:13 PM UTC
Resolved:: Aug 24 2022 03:54:08 PM UTC
Confidence Status Last Update:: 11/Aug/22 3:42 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates