Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.2.0, 5.0.4, 5.1.0-rc2
Affects Version/s: 5.1.0-rc0
Component/s: Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.1, v5.0
Sprint:
Sharding 2021-11-01
Story Points:
1
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Typically the ReshardingOplogFetcher stops fetching new donor oplog entries after it has retrieved the reshardFinalOp entry. However, upon resuming from a primary failover, the ReshardingOplogFetcher won't realize it has fetched the reshardFinalOp entry already and it'll continue to retrieve empty batches. (The donor shard won't write any new oplog entries destined for the recipient shard after the reshardFinalOp.) This leads the ReshardingOplogFetcher to insert no-op reshardProgressMark entries after the reshardFinalOp intos the recipient shard's local oplog buffer collection.

The ReshardingDonorOplogIterator assumes the reshardFinalOp will be the last entry in a batch. The presence of these extra no-op reshardProgressMark entries after the reshardFinalOp will prevent it and the ReshardingOplogApplier from realizing the reshardFinalOp has been applied through. The recipient shard therefore never reaches the "strict-consistency" state. This leads the overall resharding operation to fail with a ReshardingCriticalSectionTimeout error response after writes been blocked for the collection being resharded for reshardingCriticalSectionTimeoutMillis (5 seconds by default).

if (!batch.empty()) {
    const auto& lastEntryInBatch = batch.back();
    _resumeToken = getId(lastEntryInBatch);

    if (isFinalOplog(lastEntryInBatch)) {
        _hasSeenFinalOplogEntry = true;
        // Skip returning the final oplog entry because it is known to be a no-op.
        batch.pop_back();

is caused by

SERVER-49897 Insert no-op entries into oplog buffer collections for resharding so resuming is less wasteful

Closed

is depended on by

SERVER-57686 We need test coverage that runs resharding in the face of elections

Closed

Assignee:: Max Hirschhorn
Reporter:: Max Hirschhorn
Participants:: Githook User, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Oct 18 2021 01:26:33 PM UTC
Updated:: Oct 29 2023 09:47:15 PM UTC
Resolved:: Oct 21 2021 11:50:34 AM UTC
Confidence Status Last Update:: 20/Oct/21 4:09 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates