Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.0.4, 5.1.0-rc0
Affects Version/s: None
Component/s: Sharding
Labels:
- PM-234-M3
- PM-234-T-oplog-fetch

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.0
Sprint:
Sharding 2021-09-20
Story Points:
1
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The ReshardingOplogFetcher currently specifies {afterClusterTime: <_id.ts of highest entry>, level: "majority"} when it should instead specify {afterClusterTime: <_id.clusterTime of highest entry>, level: "majority"}. Consider the following sequence:

The donor shard prepares and commits a cross-shard transaction which spans multiple oplog entries. Let's say they have timestamps {ts: 10} and {ts: 12}.
The ReshardingOplogFetcher inserts the first oplog entry from the cross-shard transaction. It uses {_id: {clusterTime: 12, ts: 10}} for the first oplog entry.
The ReshardingOplogFetcher is restarted. It specifies {afterClusterTime: 10, level: "majority"}.
Suppose that the ReshardingOplogFetcher targets a stale secondary within the donor replica set shard. The afterClusterTime will only wait for the stale secondary to have fetched the prepare oplog entry.
The ReshardingOplogFetcher receives an empty batch. (This is because resharding's oplog fetching pipeline only returns oplog entries from multi-statement transaction when they commit or abort.) The postBatchResumeToken value with be {ts: 10} because this is the highest entry in the secondary's oplog.
The changes from ~~SERVER-49897~~ lead the recipient to insert a synthetic no-op oplog entry into the oplog buffer collection with {_id: {clusterTime: 10, ts: 10}}. This violates the assumption that the oplog buffer collection is always written to in increasing order of _id.
If the ReshardingOplogFetcher is restarted again, it'll specify {_id: {$gt: {_id: {clusterTime: 10, ts: 10}}}} and won't skip over the first oplog entry with {_id: {clusterTime: 12, ts: 10}}. It will attempt attempt to insert that oplog entry again and trigger a DuplicateKey error.

The ReshardingOplogFetcher will be perpetually stuck hitting DuplicateKey errors and will prevent the recipient shard from ever reaching the "strict-consistency" state. One solution is to have the ReshardingOplogFetcher specify {afterClusterTime: 12, level: "majority"} so that it could only ever see a postBatchResumeToken value of {ts: 12}.

if (_useReadConcern) {
    auto readConcernArgs = repl::ReadConcernArgs(
        boost::optional<LogicalTime>(_startAt.getTs()),
        boost::optional<repl::ReadConcernLevel>(repl::ReadConcernLevel::kMajorityReadConcern));
    aggRequest.setReadConcern(readConcernArgs.toBSONInner());
}

https://github.com/mongodb/mongo/blob/43479818bd01f27ee25b6e992045529d2ac0185a/src/mongo/db/s/resharding/resharding_oplog_fetcher.cpp#L259

[js_test:resharding_fuzzer-120e1-1630680862386-9] d20024| 2021-09-03T15:10:16.607+00:00 I  COMMAND  21581   [conn1] "Received replSetStepUp request"
[js_test:resharding_fuzzer-120e1-1630680862386-9] d20024| 2021-09-03T15:10:17.888+00:00 W  RESHARD  5127200 [OplogFetcher-985f9d35-1d7c-4735-9252-e1a426999574-shard0] "Error while fetching, retrying.","attr":{"error":{"code":11000,"codeName":"DuplicateKey","errmsg":"E11000 duplicate key error collection: config.localReshardingOplogBuffer.5163afeb-5991-43f6-a2e2-851827f35529.shard0 index: _id_ dup key: { _id: { clusterTime: Timestamp(1630681816, 10), ts: Timestamp(1630681816, 7) } }","keyPattern":{"_id":1},"keyValue":{"_id":{"clusterTime":{"$timestamp":{"t":1630681816,"i":10}},"ts":{"$timestamp":{"t":1630681816,"i":7}}}}}}
...
[js_test:resharding_fuzzer-120e1-1630680862386-9] d20024| 2021-09-03T15:20:39.030+00:00 W  RESHARD  5127200 [OplogFetcher-985f9d35-1d7c-4735-9252-e1a426999574-shard0] "Error while fetching, retrying.","attr":{"error":{"code":11000,"codeName":"DuplicateKey","errmsg":"E11000 duplicate key error collection: config.localReshardingOplogBuffer.5163afeb-5991-43f6-a2e2-851827f35529.shard0 index: _id_ dup key: { _id: { clusterTime: Timestamp(1630681816, 10), ts: Timestamp(1630681816, 7) } }","keyPattern":{"_id":1},"keyValue":{"_id":{"clusterTime":{"$timestamp":{"t":1630681816,"i":10}},"ts":{"$timestamp":{"t":1630681816,"i":7}}}}}}
[js_test:resharding_fuzzer-120e1-1630680862386-9] assert.soon failed, msg : Timed out waiting for failpoint to be hit. Failpoint: reshardingPauseCoordinatorBeforeDecisionPersisted The hang analyzer is automatically called in assert.soon functions. If you are *expecting* assert.soon to possibly fail, call assert.soon with {runHangAnalyzer: false} as the fifth argument (you can fill unused arguments with `undefined`). Running hang analyzer from assert.soon.

https://evergreen.mongodb.com/lobster/build/525eaf3cc1abdad83ac549b6da098c9d/test/61323aa854f2483d51452425#bookmarks=0%2C36209%2C37732%2C94544%2C95519&f~=000~%5C%5BResharding.%2AService&f~=100~d20024%5C%7C&f~=110~%28REPL_HB%7CELECTION%29&f~=100~DuplicateKey

is caused by

SERVER-49897 Insert no-op entries into oplog buffer collections for resharding so resuming is less wasteful

Closed

is depended on by

SERVER-53351 Add resharding fuzzer task with step-ups enabled for shards

Closed

Assignee:: Max Hirschhorn
Reporter:: Max Hirschhorn
Participants:: Githook User, Max Hirschhorn, Vivian Ge
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Sep 07 2021 01:30:17 PM UTC
Updated:: Oct 29 2023 09:48:50 PM UTC
Resolved:: Sep 08 2021 12:53:08 PM UTC
Confidence Status Last Update:: 07/Sep/21 2:30 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates