Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-59927

Resharding's RecipientStateMachine::_restoreMetrics() doesn't retry on transient errors, leading to fassert() on stepdown

    • Fully Compatible
    • ALL
    • v5.1, v5.0
    • Sharding 2021-09-20, Sharding 2021-10-04, Sharding 2021-10-18
    • 1

      The changes from fee0349 as part of SERVER-53912 introduced a RecipientStateMachine::_restoreMetrics() method to calculate the number of documents it cloned, oplog entries it fetched, and oplog entries it applied at the beginning of starting to run again. These read operations may be interrupted if the primary steps down shortly after having been stepped up. The call to RecipientStateMachine::_restoreMetrics() should be placed in a resharding::WithAutomaticRetry() block so any transient errors can be automatically retried and synchronized with the stepdown token being canceled.

      As a bonus on this ticket, we should see if it is possible to have the resharding code invariant that all usages of CancelableOperationContextFactory occur within a resharding::WithAutomaticRetry() block.


      [2021/09/02 13:54:14.598] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.598+00:00 I  COMMAND  21581   [conn1] "Received replSetStepUp request"
      [2021/09/02 13:54:14.603] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.602+00:00 I  REPL     21358   [ReplCoord-7] "Replica set state transition","attr":{"newState":"PRIMARY","oldState":"SECONDARY"}
      [2021/09/02 13:54:14.615] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.615+00:00 I  REPL     21331   [OplogApplier-0] "Transition to primary complete; database writes are now permitted"
      [2021/09/02 13:54:14.823] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.823+00:00 I  REPL     21402   [conn4] "Stepping down from primary, because a new term has begun","attr":{"term":6}
      ...
      [2021/09/02 13:54:14.825] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.825+00:00 F  RESHARD  5551101 [ReshardingRecipientService-5] "Unrecoverable error occurred past the point recipient was prepared to complete the resharding operation","attr":{"error":"InterruptedDueToReplStateChange: operation was interrupted"}
      [2021/09/02 13:54:14.825] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.825+00:00 F  ASSERT   23089   [ReshardingRecipientService-5] "Fatal assertion","attr":{"msgid":5551101,"file":"src/mongo/db/s/resharding/resharding_recipient_service.cpp","line":404}
      

      https://evergreen.mongodb.com/lobster/evergreen/test/mongodb_mongo_master_enterprise_rhel_80_64_bit_resharding_fuzzer_stepup_1_enterprise_rhel_80_64_bit_patch_23f9d2a53917d63fc3d3b8c8646f40f2bc4caa2f_6130c5d561837d6514713be6_21_09_02_12_44_55/0/6130e1232fd552933c3a0c9a#bookmarks=0%2C26712%2C26783%2C34982%2C35005%2C35160%2C35363%2C35372%2C35373%2C35629%2C153752%2C153798&f~=000~d20021%5C%7C&f~=100~%5C%5BResharding.%2AService&f~=010~%28REPL_HB%7CELECTION%29&f~=011~REPL_HB&l=1

            Assignee:
            brett.nawrocki@mongodb.com Brett Nawrocki
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: