Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-73955

ReshardingMetrics Can Outlive ReshardingCoordinatorService

    • Sharding NYC
    • Fully Compatible
    • ALL
    • v7.0, v6.3
    • Hide

      Build with the below git patch and then run the provided db_s_config_server_test invocation.

      build/install/bin/db_s_config_server_test --filter StepDownStepUpEachTransition
      
      Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      diff --git a/src/mongo/db/s/resharding/resharding_coordinator_service.cpp b/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
      index e57ec94edae..be223c6ca3c 100644
      --- a/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
      +++ b/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
      @@ -710,6 +710,21 @@ ReshardingMetrics::CoordinatorState toMetricsState(CoordinatorStateEnum state) {
           return ReshardingMetrics::CoordinatorState(state);
       }
       
      +struct Sentinel {
      +    Sentinel() : original{true} {}
      +    Sentinel(const Sentinel&) = delete;
      +    Sentinel(Sentinel&& other) : original{false} {
      +        std::swap(original, other.original);
      +    }
      +    ~Sentinel() {
      +        if (original) {
      +            sleepmillis(10 * 1000);
      +        }
      +    }
      +
      +    bool original;
      +};
      +
       }  // namespace
       
       namespace resharding {
      @@ -1631,7 +1646,7 @@ ExecutorFuture<void> ReshardingCoordinator::_runReshardingOp(
                       .thenRunOn(_coordinatorService->getInstanceCleanupExecutor())
                       .onCompletion([outerStatus](Status) { return outerStatus; });
               })
      -        .onCompletion([this, self = shared_from_this()](Status status) {
      +        .onCompletion([this, self = shared_from_this(), sentinel = Sentinel{}](Status status) {
                   _metrics->onStateTransition(toMetricsState(_coordinatorDoc.getState()), boost::none);
       
                   // Destroy metrics early so it's lifetime will not be tied to the lifetime of this
      

       

      Show
      Build with the below git patch and then run the provided db_s_config_server_test invocation. build/install/bin/db_s_config_server_test --filter StepDownStepUpEachTransition Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml diff --git a/src/mongo/db/s/resharding/resharding_coordinator_service.cpp b/src/mongo/db/s/resharding/resharding_coordinator_service.cpp index e57ec94edae..be223c6ca3c 100644 --- a/src/mongo/db/s/resharding/resharding_coordinator_service.cpp +++ b/src/mongo/db/s/resharding/resharding_coordinator_service.cpp @@ -710,6 +710,21 @@ ReshardingMetrics::CoordinatorState toMetricsState(CoordinatorStateEnum state) { return ReshardingMetrics::CoordinatorState(state); } +struct Sentinel { + Sentinel() : original{ true } {} + Sentinel( const Sentinel&) = delete; + Sentinel(Sentinel&& other) : original{ false } { + std::swap(original, other.original); + } + ~Sentinel() { + if (original) { + sleepmillis(10 * 1000); + } + } + + bool original; +}; + } // namespace namespace resharding { @@ -1631,7 +1646,7 @@ ExecutorFuture<void> ReshardingCoordinator::_runReshardingOp( .thenRunOn(_coordinatorService->getInstanceCleanupExecutor()) .onCompletion([outerStatus](Status) { return outerStatus; }); }) - .onCompletion([ this , self = shared_from_this()](Status status) { + .onCompletion([ this , self = shared_from_this(), sentinel = Sentinel{}](Status status) { _metrics->onStateTransition(toMetricsState(_coordinatorDoc.getState()), boost::none); // Destroy metrics early so it's lifetime will not be tied to the lifetime of this  
    • 10
    • 2

      ShardingDataTransformCumulativeMetrics invariants that newly registered instance metrics are successfully inserted into its map. This invariant would only fail if trying to insert two instance metrics with the same UUID into the set, most likely due to registering the same instance metrics twice.

      ReshardingMetrics is an RAII type which registers itself with the cumulative metrics on creation and deregisters itself on destruction, and the ReshardingCoordinatorService holds its metrics in a shared_ptr.

      The failing test loops through each coordinator state, steps down, then steps back up again. When stepping back up, PrimaryOnlyService tries to wait for the previous instance to complete before creating the new instance. It does this by waiting for the previous instance's scoped executor to complete and also for its run() method to return. Once these two things are done, PrimaryOnlyService will release its pointer to the previous instance.

      However, that previous instance may still exist in memory if something else still holds a pointer to it. In this case, that something else is the resharding coordinator's final continuation to its run() method, which captures a shared_ptr to itself. This continuation is running on the PrimaryOnlyService's cleanup executor, which unlike the scoped executor, is not joined before stepping up.

      This means that there is a race between PrimaryOnlyService stepping up the new instance and the cleanup executor allowing the callback lambda to go out of scope. As a result, its possible for the new instance to get created and attempt to register its ReshardingMetrics (with the same UUID), before the previous instance is destroyed and deregisters its metrics.

      A similar issue was previously seen and fixed by SERVER-67370, which reset the metrics pointer held by the ReshardingCoordinatorService in the final callback. However, there is still a pointer to the metrics held by the coordinator if the commit monitor was started, since the commit monitor also holds a pointer to the metrics.

            Assignee:
            brett.nawrocki@mongodb.com Brett Nawrocki
            Reporter:
            brett.nawrocki@mongodb.com Brett Nawrocki
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: