[SERVER-73955] ReshardingMetrics Can Outlive ReshardingCoordinatorService Created: 13/Feb/23  Updated: 29/Oct/23  Resolved: 25/Apr/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 7.0.0-rc1

Type: Bug Priority: Major - P3
Reporter: Brett Nawrocki Assignee: Brett Nawrocki
Resolution: Fixed Votes: 0
Labels: new-eng, sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to SERVER-67370 Resharding metrics future callback ca... Closed
Assigned Teams:
Sharding NYC
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0, v6.3
Steps To Reproduce:

Build with the below git patch and then run the provided db_s_config_server_test invocation.

build/install/bin/db_s_config_server_test --filter StepDownStepUpEachTransition

diff --git a/src/mongo/db/s/resharding/resharding_coordinator_service.cpp b/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
index e57ec94edae..be223c6ca3c 100644
--- a/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
+++ b/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
@@ -710,6 +710,21 @@ ReshardingMetrics::CoordinatorState toMetricsState(CoordinatorStateEnum state) {
     return ReshardingMetrics::CoordinatorState(state);
 }
 
+struct Sentinel {
+    Sentinel() : original{true} {}
+    Sentinel(const Sentinel&) = delete;
+    Sentinel(Sentinel&& other) : original{false} {
+        std::swap(original, other.original);
+    }
+    ~Sentinel() {
+        if (original) {
+            sleepmillis(10 * 1000);
+        }
+    }
+
+    bool original;
+};
+
 }  // namespace
 
 namespace resharding {
@@ -1631,7 +1646,7 @@ ExecutorFuture<void> ReshardingCoordinator::_runReshardingOp(
                 .thenRunOn(_coordinatorService->getInstanceCleanupExecutor())
                 .onCompletion([outerStatus](Status) { return outerStatus; });
         })
-        .onCompletion([this, self = shared_from_this()](Status status) {
+        .onCompletion([this, self = shared_from_this(), sentinel = Sentinel{}](Status status) {
             _metrics->onStateTransition(toMetricsState(_coordinatorDoc.getState()), boost::none);
 
             // Destroy metrics early so it's lifetime will not be tied to the lifetime of this

 

Participants:
Linked BF Score: 10
Story Points: 2

 Description   

ShardingDataTransformCumulativeMetrics invariants that newly registered instance metrics are successfully inserted into its map. This invariant would only fail if trying to insert two instance metrics with the same UUID into the set, most likely due to registering the same instance metrics twice.

ReshardingMetrics is an RAII type which registers itself with the cumulative metrics on creation and deregisters itself on destruction, and the ReshardingCoordinatorService holds its metrics in a shared_ptr.

The failing test loops through each coordinator state, steps down, then steps back up again. When stepping back up, PrimaryOnlyService tries to wait for the previous instance to complete before creating the new instance. It does this by waiting for the previous instance's scoped executor to complete and also for its run() method to return. Once these two things are done, PrimaryOnlyService will release its pointer to the previous instance.

However, that previous instance may still exist in memory if something else still holds a pointer to it. In this case, that something else is the resharding coordinator's final continuation to its run() method, which captures a shared_ptr to itself. This continuation is running on the PrimaryOnlyService's cleanup executor, which unlike the scoped executor, is not joined before stepping up.

This means that there is a race between PrimaryOnlyService stepping up the new instance and the cleanup executor allowing the callback lambda to go out of scope. As a result, its possible for the new instance to get created and attempt to register its ReshardingMetrics (with the same UUID), before the previous instance is destroyed and deregisters its metrics.

A similar issue was previously seen and fixed by SERVER-67370, which reset the metrics pointer held by the ReshardingCoordinatorService in the final callback. However, there is still a pointer to the metrics held by the coordinator if the commit monitor was started, since the commit monitor also holds a pointer to the metrics.



 Comments   
Comment by Githook User [ 04/May/23 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-73955 Prevent metrics from outliving ReshardingCoordinator

(cherry picked from commit a932d62b42431ad9daf8a374a8f5d626968b8195)
Branch: v7.0
https://github.com/mongodb/mongo/commit/0e35024f70e580429c727c99bc6d53f0fca72226

Comment by Githook User [ 21/Apr/23 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-73955 Prevent metrics from outliving ReshardingCoordinator
Branch: master
https://github.com/mongodb/mongo/commit/a932d62b42431ad9daf8a374a8f5d626968b8195

Generated at Thu Feb 08 06:26:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.