Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-59812

ReshardingMetrics::onStepDown() is called while data replication components are still running, leading to an invariant failure

    • Fully Compatible
    • ALL
    • v5.0
    • Sharding 2021-09-20
    • 1

      RecipientStateMachine::_runMandatoryCleanup() must wait for _dataReplicationQuiesced to have become ready before calling ReshardingMetrics::onStepDown(). It is otherwise possible for a data replication component (e.g. ReshardingCollectionCloner) to still be running and attempting to tick a ReshardingMetric counter.

      ExecutorFuture<void> ReshardingRecipientService::RecipientStateMachine::_runMandatoryCleanup(
          Status status, const CancellationToken& stepdownToken) {
          if (stepdownToken.isCanceled()) {
              // Interrupt occured, ensure the metrics get shut down.
              _metrics()->onStepDown(ReshardingMetrics::Role::kRecipient);
          }
      
          return _dataReplicationQuiesced.thenRunOn(_recipientService->getInstanceCleanupExecutor())
              .onCompletion([this, self = shared_from_this(), outerStatus = status](
                                Status dataReplicationHaltStatus) {
                  // Wait for all of the data replication components to halt. We ignore any data
                  // replication errors because resharding is known to have failed already.
                  stdx::lock_guard<Latch> lk(_mutex);
                  ensureFulfilledPromise(lk, _completionPromise, outerStatus);
      
                  return outerStatus;
              });
      }
      

      [js_test:resharding_fuzzer-120e1-1630670493876-3] d20026| 2021-09-03T12:07:52.729+00:00 F  ASSERT   23081   [ReshardingRecipientService-2] "Invariant failure","attr":{"expr":"_currentOp","msg":"No operation is in progress","file":"src/mongo/db/s/resharding/resharding_metrics.cpp","line":596}
      ...
      [js_test:resharding_fuzzer-120e1-1630670493876-3] d20026| 2021-09-03T12:07:52.988+00:00 I  CONTROL  31445   [ReshardingRecipientService-2] "Frame","attr":{"frame":{"a":"55D0EDE6704C","b":"55D0D9F99000","o":"13ECE04C","s":"_ZN5mongo22invariantFailedWithMsgEPKcRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES1_j","s+":"10C"}}
      [js_test:resharding_fuzzer-120e1-1630670493876-3] d20026| 2021-09-03T12:07:52.988+00:00 I  CONTROL  31445   [ReshardingRecipientService-2] "Frame","attr":{"frame":{"a":"55D0EA6D26FB","b":"55D0D9F99000","o":"107396FB","s":"_ZN5mongo17ReshardingMetrics30onCollClonerFillBatchForInsertENS_8DurationISt5ratioILl1ELl1000EEEE","s+":"1AB"}}
      [js_test:resharding_fuzzer-120e1-1630670493876-3] d20026| 2021-09-03T12:07:52.988+00:00 I  CONTROL  31445   [ReshardingRecipientService-2] "Frame","attr":{"frame":{"a":"55D0EA577026","b":"55D0D9F99000","o":"105DE026","s":"_ZN5mongo26ReshardingCollectionCloner10doOneBatchEPNS_16OperationContextERNS_8PipelineE","s+":"E6"}}
      [js_test:resharding_fuzzer-120e1-1630670493876-3] d20026| 2021-09-03T12:07:52.988+00:00 I  CONTROL  31445   [ReshardingRecipientService-2] "Frame","attr":{"frame":{"a":"55D0EA57B0BF","b":"55D0D9F99000","o":"105E20BF","s":"_ZN5mongo19makeReadyFutureWithIRZNS_26ReshardingCollectionCloner3runESt10shared_ptrINS_8executor12TaskExecutorEES5_NS_17CancellationTokenENS_33CancelableOperationContextFactoryEE3$_4Li0EEENS_6FutureINS_14future_details17UnwrappedTypeImplINSt13invoke_resultIOT_JEE4typeEE4typeEEESF_","s+":"DF"}}
      

      https://evergreen.mongodb.com/lobster/build/adbede2ae05d3e03fc66c712aebde8c8/test/61320fed54f2483d513d41a8#bookmarks=0%2C25708%2C25722%2C25801%2C121140%2C122300&f~=000~%5C%5BResharding.%2AService&f~=100~d20026%5C%7C&l=1

      https://github.com/mongodb/mongo/blob/43479818bd01f27ee25b6e992045529d2ac0185a/src/mongo/db/s/resharding/resharding_metrics.cpp#L596

            Assignee:
            luis.osta@mongodb.com Luis Osta (Inactive)
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: