Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-59811

ReshardingOplogFetcher can see the postBatchResumeToken move backwards in time when switching sync sources, stalling the resharding operation

    • Fully Compatible
    • ALL
    • v5.0
    • Sharding 2021-09-20
    • 1

      The ReshardingOplogFetcher currently specifies {afterClusterTime: <_id.ts of highest entry>, level: "majority"} when it should instead specify {afterClusterTime: <_id.clusterTime of highest entry>, level: "majority"}. Consider the following sequence:

      1. The donor shard prepares and commits a cross-shard transaction which spans multiple oplog entries. Let's say they have timestamps {ts: 10} and {ts: 12}.
      2. The ReshardingOplogFetcher inserts the first oplog entry from the cross-shard transaction. It uses {_id: {clusterTime: 12, ts: 10}} for the first oplog entry.
      3. The ReshardingOplogFetcher is restarted. It specifies {afterClusterTime: 10, level: "majority"}.
      4. Suppose that the ReshardingOplogFetcher targets a stale secondary within the donor replica set shard. The afterClusterTime will only wait for the stale secondary to have fetched the prepare oplog entry.
      5. The ReshardingOplogFetcher receives an empty batch. (This is because resharding's oplog fetching pipeline only returns oplog entries from multi-statement transaction when they commit or abort.) The postBatchResumeToken value with be {ts: 10} because this is the highest entry in the secondary's oplog.
      6. The changes from SERVER-49897 lead the recipient to insert a synthetic no-op oplog entry into the oplog buffer collection with {_id: {clusterTime: 10, ts: 10}}. This violates the assumption that the oplog buffer collection is always written to in increasing order of _id.
      7. If the ReshardingOplogFetcher is restarted again, it'll specify {_id: {$gt: {_id: {clusterTime: 10, ts: 10}}}} and won't skip over the first oplog entry with {_id: {clusterTime: 12, ts: 10}}. It will attempt attempt to insert that oplog entry again and trigger a DuplicateKey error.

      The ReshardingOplogFetcher will be perpetually stuck hitting DuplicateKey errors and will prevent the recipient shard from ever reaching the "strict-consistency" state. One solution is to have the ReshardingOplogFetcher specify {afterClusterTime: 12, level: "majority"} so that it could only ever see a postBatchResumeToken value of {ts: 12}.

      if (_useReadConcern) {
          auto readConcernArgs = repl::ReadConcernArgs(
              boost::optional<LogicalTime>(_startAt.getTs()),
              boost::optional<repl::ReadConcernLevel>(repl::ReadConcernLevel::kMajorityReadConcern));
          aggRequest.setReadConcern(readConcernArgs.toBSONInner());
      }
      

      https://github.com/mongodb/mongo/blob/43479818bd01f27ee25b6e992045529d2ac0185a/src/mongo/db/s/resharding/resharding_oplog_fetcher.cpp#L259


      [js_test:resharding_fuzzer-120e1-1630680862386-9] d20024| 2021-09-03T15:10:16.607+00:00 I  COMMAND  21581   [conn1] "Received replSetStepUp request"
      [js_test:resharding_fuzzer-120e1-1630680862386-9] d20024| 2021-09-03T15:10:17.888+00:00 W  RESHARD  5127200 [OplogFetcher-985f9d35-1d7c-4735-9252-e1a426999574-shard0] "Error while fetching, retrying.","attr":{"error":{"code":11000,"codeName":"DuplicateKey","errmsg":"E11000 duplicate key error collection: config.localReshardingOplogBuffer.5163afeb-5991-43f6-a2e2-851827f35529.shard0 index: _id_ dup key: { _id: { clusterTime: Timestamp(1630681816, 10), ts: Timestamp(1630681816, 7) } }","keyPattern":{"_id":1},"keyValue":{"_id":{"clusterTime":{"$timestamp":{"t":1630681816,"i":10}},"ts":{"$timestamp":{"t":1630681816,"i":7}}}}}}
      ...
      [js_test:resharding_fuzzer-120e1-1630680862386-9] d20024| 2021-09-03T15:20:39.030+00:00 W  RESHARD  5127200 [OplogFetcher-985f9d35-1d7c-4735-9252-e1a426999574-shard0] "Error while fetching, retrying.","attr":{"error":{"code":11000,"codeName":"DuplicateKey","errmsg":"E11000 duplicate key error collection: config.localReshardingOplogBuffer.5163afeb-5991-43f6-a2e2-851827f35529.shard0 index: _id_ dup key: { _id: { clusterTime: Timestamp(1630681816, 10), ts: Timestamp(1630681816, 7) } }","keyPattern":{"_id":1},"keyValue":{"_id":{"clusterTime":{"$timestamp":{"t":1630681816,"i":10}},"ts":{"$timestamp":{"t":1630681816,"i":7}}}}}}
      [js_test:resharding_fuzzer-120e1-1630680862386-9] assert.soon failed, msg : Timed out waiting for failpoint to be hit. Failpoint: reshardingPauseCoordinatorBeforeDecisionPersisted The hang analyzer is automatically called in assert.soon functions. If you are *expecting* assert.soon to possibly fail, call assert.soon with {runHangAnalyzer: false} as the fifth argument (you can fill unused arguments with `undefined`). Running hang analyzer from assert.soon.
      

      https://evergreen.mongodb.com/lobster/build/525eaf3cc1abdad83ac549b6da098c9d/test/61323aa854f2483d51452425#bookmarks=0%2C36209%2C37732%2C94544%2C95519&f~=000~%5C%5BResharding.%2AService&f~=100~d20024%5C%7C&f~=110~%28REPL_HB%7CELECTION%29&f~=100~DuplicateKey

            Assignee:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: