[SERVER-59812] ReshardingMetrics::onStepDown() is called while data replication components are still running, leading to an invariant failure Created: 07/Sep/21  Updated: 29/Oct/23  Resolved: 09/Sep/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.0
Fix Version/s: 5.0.4, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Luis Osta (Inactive)
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-lifecycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-53351 Add resharding fuzzer task with step-... Closed
Related
is related to SERVER-56658 Use the cleanup executor to fulfill r... Closed
is related to SERVER-57263 Use resharding metrics stepUp/stepDow... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Sprint: Sharding 2021-09-20
Participants:
Story Points: 1

 Description   

RecipientStateMachine::_runMandatoryCleanup() must wait for _dataReplicationQuiesced to have become ready before calling ReshardingMetrics::onStepDown(). It is otherwise possible for a data replication component (e.g. ReshardingCollectionCloner) to still be running and attempting to tick a ReshardingMetric counter.

ExecutorFuture<void> ReshardingRecipientService::RecipientStateMachine::_runMandatoryCleanup(
    Status status, const CancellationToken& stepdownToken) {
    if (stepdownToken.isCanceled()) {
        // Interrupt occured, ensure the metrics get shut down.
        _metrics()->onStepDown(ReshardingMetrics::Role::kRecipient);
    }
 
    return _dataReplicationQuiesced.thenRunOn(_recipientService->getInstanceCleanupExecutor())
        .onCompletion([this, self = shared_from_this(), outerStatus = status](
                          Status dataReplicationHaltStatus) {
            // Wait for all of the data replication components to halt. We ignore any data
            // replication errors because resharding is known to have failed already.
            stdx::lock_guard<Latch> lk(_mutex);
            ensureFulfilledPromise(lk, _completionPromise, outerStatus);
 
            return outerStatus;
        });
}


[js_test:resharding_fuzzer-120e1-1630670493876-3] d20026| 2021-09-03T12:07:52.729+00:00 F  ASSERT   23081   [ReshardingRecipientService-2] "Invariant failure","attr":{"expr":"_currentOp","msg":"No operation is in progress","file":"src/mongo/db/s/resharding/resharding_metrics.cpp","line":596}
...
[js_test:resharding_fuzzer-120e1-1630670493876-3] d20026| 2021-09-03T12:07:52.988+00:00 I  CONTROL  31445   [ReshardingRecipientService-2] "Frame","attr":{"frame":{"a":"55D0EDE6704C","b":"55D0D9F99000","o":"13ECE04C","s":"_ZN5mongo22invariantFailedWithMsgEPKcRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES1_j","s+":"10C"}}
[js_test:resharding_fuzzer-120e1-1630670493876-3] d20026| 2021-09-03T12:07:52.988+00:00 I  CONTROL  31445   [ReshardingRecipientService-2] "Frame","attr":{"frame":{"a":"55D0EA6D26FB","b":"55D0D9F99000","o":"107396FB","s":"_ZN5mongo17ReshardingMetrics30onCollClonerFillBatchForInsertENS_8DurationISt5ratioILl1ELl1000EEEE","s+":"1AB"}}
[js_test:resharding_fuzzer-120e1-1630670493876-3] d20026| 2021-09-03T12:07:52.988+00:00 I  CONTROL  31445   [ReshardingRecipientService-2] "Frame","attr":{"frame":{"a":"55D0EA577026","b":"55D0D9F99000","o":"105DE026","s":"_ZN5mongo26ReshardingCollectionCloner10doOneBatchEPNS_16OperationContextERNS_8PipelineE","s+":"E6"}}
[js_test:resharding_fuzzer-120e1-1630670493876-3] d20026| 2021-09-03T12:07:52.988+00:00 I  CONTROL  31445   [ReshardingRecipientService-2] "Frame","attr":{"frame":{"a":"55D0EA57B0BF","b":"55D0D9F99000","o":"105E20BF","s":"_ZN5mongo19makeReadyFutureWithIRZNS_26ReshardingCollectionCloner3runESt10shared_ptrINS_8executor12TaskExecutorEES5_NS_17CancellationTokenENS_33CancelableOperationContextFactoryEE3$_4Li0EEENS_6FutureINS_14future_details17UnwrappedTypeImplINSt13invoke_resultIOT_JEE4typeEE4typeEEESF_","s+":"DF"}}

https://evergreen.mongodb.com/lobster/build/adbede2ae05d3e03fc66c712aebde8c8/test/61320fed54f2483d513d41a8#bookmarks=0%2C25708%2C25722%2C25801%2C121140%2C122300&f~=000~%5C%5BResharding.%2AService&f~=100~d20026%5C%7C&l=1

https://github.com/mongodb/mongo/blob/43479818bd01f27ee25b6e992045529d2ac0185a/src/mongo/db/s/resharding/resharding_metrics.cpp#L596



 Comments   
Comment by Max Hirschhorn [ 20/Nov/21 ]

Due to a bad merge on the 5.0 branch this issue wasn't actually addressed in MongoDB version 5.0.4. It is now fixed as part of my changes from SERVER-61633.

Comment by Githook User [ 20/Nov/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-61633 Join _oplogFetcherExecutor in resharding recipient at exit.

Also corrects the 5.0 backport of
1bd1c4f6a0d571443a80c52d1b3f284a0c078af4 from SERVER-59812 and leaves
the ReshardingMetrics intact until the resharding data replication
components have quiesced.

(cherry picked from commit 34cac37ac5a61946aae9d149c8cb2f1d109e7320)
Branch: v5.0
https://github.com/mongodb/mongo/commit/3d22412e0eed75c96771a849d4e98e3309f458f0

Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 22/Sep/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-59812 Moved metrics stepDown inside onCompletion continuation
Branch: v5.0
https://github.com/mongodb/mongo/commit/910be0c7acc798cfcfe64c9c9a7d664f0f3c199f

Comment by Githook User [ 09/Sep/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-59812 Moved metrics stepDown inside onCompletion continuation
Branch: master
https://github.com/mongodb/mongo/commit/1bd1c4f6a0d571443a80c52d1b3f284a0c078af4

Generated at Thu Feb 08 05:48:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.