-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Sharding
-
Sharding NYC
-
Fully Compatible
-
ALL
-
v7.0, v6.3
-
-
10
-
2
ShardingDataTransformCumulativeMetrics invariants that newly registered instance metrics are successfully inserted into its map. This invariant would only fail if trying to insert two instance metrics with the same UUID into the set, most likely due to registering the same instance metrics twice.
ReshardingMetrics is an RAII type which registers itself with the cumulative metrics on creation and deregisters itself on destruction, and the ReshardingCoordinatorService holds its metrics in a shared_ptr.
The failing test loops through each coordinator state, steps down, then steps back up again. When stepping back up, PrimaryOnlyService tries to wait for the previous instance to complete before creating the new instance. It does this by waiting for the previous instance's scoped executor to complete and also for its run() method to return. Once these two things are done, PrimaryOnlyService will release its pointer to the previous instance.
However, that previous instance may still exist in memory if something else still holds a pointer to it. In this case, that something else is the resharding coordinator's final continuation to its run() method, which captures a shared_ptr to itself. This continuation is running on the PrimaryOnlyService's cleanup executor, which unlike the scoped executor, is not joined before stepping up.
This means that there is a race between PrimaryOnlyService stepping up the new instance and the cleanup executor allowing the callback lambda to go out of scope. As a result, its possible for the new instance to get created and attempt to register its ReshardingMetrics (with the same UUID), before the previous instance is destroyed and deregisters its metrics.
A similar issue was previously seen and fixed by SERVER-67370, which reset the metrics pointer held by the ReshardingCoordinatorService in the final callback. However, there is still a pointer to the metrics held by the coordinator if the commit monitor was started, since the commit monitor also holds a pointer to the metrics.
- is related to
-
SERVER-67370 Resharding metrics future callback can extend lifetime of ReshardingRecipient services
- Closed