[SERVER-59811] ReshardingOplogFetcher can see the postBatchResumeToken move backwards in time when switching sync sources, stalling the resharding operation Created: 07/Sep/21  Updated: 29/Oct/23  Resolved: 08/Sep/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.0.4, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-oplog-fetch
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-53351 Add resharding fuzzer task with step-... Closed
Problem/Incident
is caused by SERVER-49897 Insert no-op entries into oplog buffe... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Sprint: Sharding 2021-09-20
Participants:
Story Points: 1

 Description   

The ReshardingOplogFetcher currently specifies {afterClusterTime: <_id.ts of highest entry>, level: "majority"} when it should instead specify {afterClusterTime: <_id.clusterTime of highest entry>, level: "majority"}. Consider the following sequence:

  1. The donor shard prepares and commits a cross-shard transaction which spans multiple oplog entries. Let's say they have timestamps {ts: 10} and {ts: 12}.
  2. The ReshardingOplogFetcher inserts the first oplog entry from the cross-shard transaction. It uses {_id: {clusterTime: 12, ts: 10}} for the first oplog entry.
  3. The ReshardingOplogFetcher is restarted. It specifies {afterClusterTime: 10, level: "majority"}.
  4. Suppose that the ReshardingOplogFetcher targets a stale secondary within the donor replica set shard. The afterClusterTime will only wait for the stale secondary to have fetched the prepare oplog entry.
  5. The ReshardingOplogFetcher receives an empty batch. (This is because resharding's oplog fetching pipeline only returns oplog entries from multi-statement transaction when they commit or abort.) The postBatchResumeToken value with be {ts: 10} because this is the highest entry in the secondary's oplog.
  6. The changes from SERVER-49897 lead the recipient to insert a synthetic no-op oplog entry into the oplog buffer collection with {_id: {clusterTime: 10, ts: 10}}. This violates the assumption that the oplog buffer collection is always written to in increasing order of _id.
  7. If the ReshardingOplogFetcher is restarted again, it'll specify {_id: {$gt: {_id: {clusterTime: 10, ts: 10}}}} and won't skip over the first oplog entry with {_id: {clusterTime: 12, ts: 10}}. It will attempt attempt to insert that oplog entry again and trigger a DuplicateKey error.

The ReshardingOplogFetcher will be perpetually stuck hitting DuplicateKey errors and will prevent the recipient shard from ever reaching the "strict-consistency" state. One solution is to have the ReshardingOplogFetcher specify {afterClusterTime: 12, level: "majority"} so that it could only ever see a postBatchResumeToken value of {ts: 12}.

257
if (_useReadConcern) {
258
    auto readConcernArgs = repl::ReadConcernArgs(
259
        boost::optional<LogicalTime>(_startAt.getTs()),
260
        boost::optional<repl::ReadConcernLevel>(repl::ReadConcernLevel::kMajorityReadConcern));
261
    aggRequest.setReadConcern(readConcernArgs.toBSONInner());
262
}

https://github.com/mongodb/mongo/blob/43479818bd01f27ee25b6e992045529d2ac0185a/src/mongo/db/s/resharding/resharding_oplog_fetcher.cpp#L259


[js_test:resharding_fuzzer-120e1-1630680862386-9] d20024| 2021-09-03T15:10:16.607+00:00 I  COMMAND  21581   [conn1] "Received replSetStepUp request"
[js_test:resharding_fuzzer-120e1-1630680862386-9] d20024| 2021-09-03T15:10:17.888+00:00 W  RESHARD  5127200 [OplogFetcher-985f9d35-1d7c-4735-9252-e1a426999574-shard0] "Error while fetching, retrying.","attr":{"error":{"code":11000,"codeName":"DuplicateKey","errmsg":"E11000 duplicate key error collection: config.localReshardingOplogBuffer.5163afeb-5991-43f6-a2e2-851827f35529.shard0 index: _id_ dup key: { _id: { clusterTime: Timestamp(1630681816, 10), ts: Timestamp(1630681816, 7) } }","keyPattern":{"_id":1},"keyValue":{"_id":{"clusterTime":{"$timestamp":{"t":1630681816,"i":10}},"ts":{"$timestamp":{"t":1630681816,"i":7}}}}}}
...
[js_test:resharding_fuzzer-120e1-1630680862386-9] d20024| 2021-09-03T15:20:39.030+00:00 W  RESHARD  5127200 [OplogFetcher-985f9d35-1d7c-4735-9252-e1a426999574-shard0] "Error while fetching, retrying.","attr":{"error":{"code":11000,"codeName":"DuplicateKey","errmsg":"E11000 duplicate key error collection: config.localReshardingOplogBuffer.5163afeb-5991-43f6-a2e2-851827f35529.shard0 index: _id_ dup key: { _id: { clusterTime: Timestamp(1630681816, 10), ts: Timestamp(1630681816, 7) } }","keyPattern":{"_id":1},"keyValue":{"_id":{"clusterTime":{"$timestamp":{"t":1630681816,"i":10}},"ts":{"$timestamp":{"t":1630681816,"i":7}}}}}}
[js_test:resharding_fuzzer-120e1-1630680862386-9] assert.soon failed, msg : Timed out waiting for failpoint to be hit. Failpoint: reshardingPauseCoordinatorBeforeDecisionPersisted The hang analyzer is automatically called in assert.soon functions. If you are *expecting* assert.soon to possibly fail, call assert.soon with {runHangAnalyzer: false} as the fifth argument (you can fill unused arguments with `undefined`). Running hang analyzer from assert.soon.

https://evergreen.mongodb.com/lobster/build/525eaf3cc1abdad83ac549b6da098c9d/test/61323aa854f2483d51452425#bookmarks=0%2C36209%2C37732%2C94544%2C95519&f~=000~%5C%5BResharding.%2AService&f~=100~d20024%5C%7C&f~=110~%28REPL_HB%7CELECTION%29&f~=100~DuplicateKey



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 21/Sep/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-59811 Fix afterClusterTime in ReshardingOplogFetcher.

(cherry picked from commit c37cabcb8913e6ac5353763f80bf5ac7d724de65)
Branch: v5.0
https://github.com/mongodb/mongo/commit/54a7f116fdf8114246a735910ab96e24182def29

Comment by Githook User [ 07/Sep/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-59811 Fix afterClusterTime in ReshardingOplogFetcher.
Branch: master
https://github.com/mongodb/mongo/commit/c37cabcb8913e6ac5353763f80bf5ac7d724de65

Generated at Thu Feb 08 05:48:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.