Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.1.0-rc0, 8.0.4, 7.0.16, 6.0.20
Affects Version/s: 5.0.0, 6.0.0, 7.0.0, 7.2.0
Component/s: Sharding
Labels:

Assigned Teams:

Cluster Scalability
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.0, v7.0, v6.0, v5.0
Sprint:
Cluster Scalability 2024-5-13, Cluster Scalability 2024-5-27, Cluster Scalability 2024-6-10, Cluster Scalability 06/24/24, Cluster Scalability 2024-09-02, Cluster Scalability 2024-10-14, Cluster Scalability 2024-10-28, Cluster Scalability 2024-11-11
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

In the following algorithm for calculating remainingTime to complete for resharding op, used by resharding commit monitor:

Milliseconds remainingTime(Milliseconds elapsedTime, double elapsedWork, double totalWork) {
    elapsedWork = std::min(elapsedWork, totalWork);
    double remainingMsec = 1.0 * elapsedTime.count() * (totalWork / elapsedWork - 1);
    return Milliseconds(Milliseconds::rep(remainingMsec));
}

If the elapsedTime is of the order of few ms, the remainingMsec can be incorrectly reported. For example in the HELP-54235, with ~300k fetched oplog entries (totalWork) and a 1000 applied oplog entries (elapsedWork) and a value of elapsedTime as 6ms will result in engaging the CS as:

remainingMsec = 1.0 * 6 * (300-1) ≈ 1800 ms = 1.8 seconds < 2 seconds.

This algorithm needs to change to handle this edge case.

Note: reshardingDelayBeforeRemainingOperationTimeQueryMillis parameter was introduced and backported to the releases attached in this ticket.
In 8.0.4, it defaults to 30 seconds.
In all other branches, it defaults to 0 seconds. Its default value will be changed to 30 seconds in the backports of ~~SERVER-95311~~.

is related to

SERVER-94141 Sporadic ReshardingCriticalSectionTimeout error

Closed

related to

SERVER-95019 getElapsed in getRecipientHighEstimateRemainingTimeMillis can incorrectly cast < 1s elapsed durations to 0.

Closed

SERVER-92933 Consider making resharding recipients use exponential moving average oplog application rate to estimate 'remainingTimeMillis'

Closed

There are no Sub-Tasks for this issue.

Assignee:: Ben Gawel (Inactive)
Reporter:: Abdul Qadeer
Participants:: Abdul Qadeer, Ben Gawel, Githook User
Votes:: 0 Vote for this issue
Watchers:: 14 Start watching this issue

Created:: Jan 11 2024 05:38:51 PM UTC
Updated:: Feb 28 2025 04:52:46 PM UTC
Resolved:: Nov 01 2024 05:19:25 PM UTC

Details

Description

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates