Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-59811

ReshardingOplogFetcher can see the postBatchResumeToken move backwards in time when switching sync sources, stalling the resharding operation

    XMLWordPrintableJSON

Details

    • Fully Compatible
    • ALL
    • v5.0
    • Sharding 2021-09-20
    • 1

    Description

      The ReshardingOplogFetcher currently specifies {afterClusterTime: <_id.ts of highest entry>, level: "majority"} when it should instead specify {afterClusterTime: <_id.clusterTime of highest entry>, level: "majority"}. Consider the following sequence:

      1. The donor shard prepares and commits a cross-shard transaction which spans multiple oplog entries. Let's say they have timestamps {ts: 10} and {ts: 12}.
      2. The ReshardingOplogFetcher inserts the first oplog entry from the cross-shard transaction. It uses {_id: {clusterTime: 12, ts: 10}} for the first oplog entry.
      3. The ReshardingOplogFetcher is restarted. It specifies {afterClusterTime: 10, level: "majority"}.
      4. Suppose that the ReshardingOplogFetcher targets a stale secondary within the donor replica set shard. The afterClusterTime will only wait for the stale secondary to have fetched the prepare oplog entry.
      5. The ReshardingOplogFetcher receives an empty batch. (This is because resharding's oplog fetching pipeline only returns oplog entries from multi-statement transaction when they commit or abort.) The postBatchResumeToken value with be {ts: 10} because this is the highest entry in the secondary's oplog.
      6. The changes from SERVER-49897 lead the recipient to insert a synthetic no-op oplog entry into the oplog buffer collection with {_id: {clusterTime: 10, ts: 10}}. This violates the assumption that the oplog buffer collection is always written to in increasing order of _id.
      7. If the ReshardingOplogFetcher is restarted again, it'll specify {_id: {$gt: {_id: {clusterTime: 10, ts: 10}}}} and won't skip over the first oplog entry with {_id: {clusterTime: 12, ts: 10}}. It will attempt attempt to insert that oplog entry again and trigger a DuplicateKey error.

      The ReshardingOplogFetcher will be perpetually stuck hitting DuplicateKey errors and will prevent the recipient shard from ever reaching the "strict-consistency" state. One solution is to have the ReshardingOplogFetcher specify {afterClusterTime: 12, level: "majority"} so that it could only ever see a postBatchResumeToken value of {ts: 12}.

      257
      if (_useReadConcern) {
      258
          auto readConcernArgs = repl::ReadConcernArgs(
      259
              boost::optional<LogicalTime>(_startAt.getTs()),
      260
              boost::optional<repl::ReadConcernLevel>(repl::ReadConcernLevel::kMajorityReadConcern));
      261
          aggRequest.setReadConcern(readConcernArgs.toBSONInner());
      262
      }
      

      https://github.com/mongodb/mongo/blob/43479818bd01f27ee25b6e992045529d2ac0185a/src/mongo/db/s/resharding/resharding_oplog_fetcher.cpp#L259


      [js_test:resharding_fuzzer-120e1-1630680862386-9] d20024| 2021-09-03T15:10:16.607+00:00 I  COMMAND  21581   [conn1] "Received replSetStepUp request"
      [js_test:resharding_fuzzer-120e1-1630680862386-9] d20024| 2021-09-03T15:10:17.888+00:00 W  RESHARD  5127200 [OplogFetcher-985f9d35-1d7c-4735-9252-e1a426999574-shard0] "Error while fetching, retrying.","attr":{"error":{"code":11000,"codeName":"DuplicateKey","errmsg":"E11000 duplicate key error collection: config.localReshardingOplogBuffer.5163afeb-5991-43f6-a2e2-851827f35529.shard0 index: _id_ dup key: { _id: { clusterTime: Timestamp(1630681816, 10), ts: Timestamp(1630681816, 7) } }","keyPattern":{"_id":1},"keyValue":{"_id":{"clusterTime":{"$timestamp":{"t":1630681816,"i":10}},"ts":{"$timestamp":{"t":1630681816,"i":7}}}}}}
      ...
      [js_test:resharding_fuzzer-120e1-1630680862386-9] d20024| 2021-09-03T15:20:39.030+00:00 W  RESHARD  5127200 [OplogFetcher-985f9d35-1d7c-4735-9252-e1a426999574-shard0] "Error while fetching, retrying.","attr":{"error":{"code":11000,"codeName":"DuplicateKey","errmsg":"E11000 duplicate key error collection: config.localReshardingOplogBuffer.5163afeb-5991-43f6-a2e2-851827f35529.shard0 index: _id_ dup key: { _id: { clusterTime: Timestamp(1630681816, 10), ts: Timestamp(1630681816, 7) } }","keyPattern":{"_id":1},"keyValue":{"_id":{"clusterTime":{"$timestamp":{"t":1630681816,"i":10}},"ts":{"$timestamp":{"t":1630681816,"i":7}}}}}}
      [js_test:resharding_fuzzer-120e1-1630680862386-9] assert.soon failed, msg : Timed out waiting for failpoint to be hit. Failpoint: reshardingPauseCoordinatorBeforeDecisionPersisted The hang analyzer is automatically called in assert.soon functions. If you are *expecting* assert.soon to possibly fail, call assert.soon with {runHangAnalyzer: false} as the fifth argument (you can fill unused arguments with `undefined`). Running hang analyzer from assert.soon.
      

      https://evergreen.mongodb.com/lobster/build/525eaf3cc1abdad83ac549b6da098c9d/test/61323aa854f2483d51452425#bookmarks=0%2C36209%2C37732%2C94544%2C95519&f~=000~%5C%5BResharding.%2AService&f~=100~d20024%5C%7C&f~=110~%28REPL_HB%7CELECTION%29&f~=100~DuplicateKey

      Attachments

        Activity

          People

            max.hirschhorn@mongodb.com Max Hirschhorn
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: