[SERVER-54460] Resharding may delete the state document before fully completing Created: 11/Feb/21  Updated: 29/Oct/23  Resolved: 04/May/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.9.0-rc1, 5.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Cheahuychou Mao
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-lifecycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
is duplicated by SERVER-52837 Gracefully shut down resharding state... Closed
Related
related to SERVER-60775 PrimaryOnlyService won't wait for pri... Backlog
related to SERVER-57195 Convert resharding document deletion ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.9
Steps To Reproduce:

1. change the src/mongo/db/repl/primary_only_service_op_observer.cpp PrimaryOnlyServiceOpObserver::onDelete() to release the service with error:

service->releaseInstance(
documentId,
Status(ErrorCodes::Interrupted,
str::stream() << "State document " << documentId << " is dropped",
BSON("documentId" << documentId)));

2. run the test:

buildscripts/resmoke.py run --suite=sharding --repeat 1 --mongodSetParameters="

{ featureFlagTenantMigrations: true}

" jstests/sharding/api_params_nontransaction_sharded.js

it will fail with:
Error: command failed: {
...
"errmsg" : "State document

{ _id: UUID(\"2e1c206c-d618-4f8c-ba0f-247637bea29c\") }

is dropped",
...

Sprint: Sharding 2021-05-03, Sharding 2021-05-17
Participants:
Story Points: 2

 Description   

I do not claim that this issue can cause actual production failures, but it was a real problem for me blocking from fully implementing SERVER-53950.

The idea I was trying to implement in SERVER-53950 was that we should always interrupt the primary service instance whenever we unregister it. One of the things that unregisters the service is the deletion of the state document.

However if I make this bridge as discussed in that bug, the resharding fails at the moment the state doc is deleted, before completion. I don't see a simple fix myself.



 Comments   
Comment by Githook User [ 04/May/21 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-54460 Ensure that opCtx that waits on PrimaryOnlyService completion promise gets killed on stepdown

(cherry picked from commit c02a82e18fe3fc3cf9ed76962fe05c22bf376332)
Branch: v4.9
https://github.com/mongodb/mongo/commit/5432326eb039374337e3780406dcf4646e69d1b8

Comment by Githook User [ 04/May/21 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-54460 Ensure that opCtx that waits on PrimaryOnlyService completion promise gets killed on stepdown
Branch: master
https://github.com/mongodb/mongo/commit/c02a82e18fe3fc3cf9ed76962fe05c22bf376332

Comment by Max Hirschhorn [ 20/Feb/21 ]

I chatted with Andrew and Lingzhi over Zoom about this ticket on Friday.

There is a bug in ReshardingCoordinator where it is possible for _completionPromise to never be set. This is because the coordinator state document is removed in an earlier lambda function than one which fulfills the _completionPromise. It is therefore possible for the ScopedTaskExecutor to be shut down and for the onCompletion() callback to never be called.

  • The primary-only services for tenant migrations get around this by scheduling tasks on the parent task executor (i.e. the ThreadPoolTaskExecutor underlying the ScopedTaskExecutor which isn't ever shut down). I don't think resharding's needs to do the same. My impression is that registering an onCommit() handler which fulfills _completionPromise would achieve the same result.
  • Of course, we would also need to fix PrimaryOnlyServiceOpObserver so releaseInstance() and releaseAllInstances() are called in onCommit() handlers too. PrimaryOnlyServiceOpObserver currently deregisters the primary-only service Instance before the storage transaction actually commits (and without regard to whether it might roll back).
  • The storage integration layer guarantees that onCommit() handlers are called in the order they were registered. Making the sequence be (1) primary-only service Instance registers an onCommit() handler to fulfill _completionPromise, (2) it then deletes the state document, (3) PrimaryOnlyServiceOpObserver::onDelete() is called with storage transaction still active and registers an onCommit() handler to call interrupt() on the Instance. The call to interrupt() on the Instance is a no-op as desired.

The following patch is an implementation of my idea to use an onCommit() handler. The *reshard*.js tests pass locally with the exception of a known failure in resharding_clones_duplicate_key.js.

For whoever picks this ticket up, I think we should additionally make PrimaryOnlyService::releaseInstance() and PrimaryOnlyService::Instance::interrupt() noexcept to reflect intent. Functions in onCommit() handlers are not allowed to throw anyway.

diff --git a/src/mongo/db/repl/primary_only_service_op_observer.cpp b/src/mongo/db/repl/primary_only_service_op_observer.cpp
index 711b7e765a..0decbba0a8 100644
--- a/src/mongo/db/repl/primary_only_service_op_observer.cpp
+++ b/src/mongo/db/repl/primary_only_service_op_observer.cpp
@@ -70,10 +70,15 @@ void PrimaryOnlyServiceOpObserver::onDelete(OperationContext* opCtx,
     if (!service) {
         return;
     }
-    // Passing OK() as an argument does not invoke the interrupt() method on the instance.
-    // TODO(SERVER-54460): when state document deletion race is fixed in resharding, release with
-    // error as in 'onDropCollection()'.
-    service->releaseInstance(documentId, Status::OK());
+
+    opCtx->recoveryUnit()->onCommit(
+        [service, documentId](boost::optional<Timestamp> unusedCommitTime) {
+            service->releaseInstance(
+                documentId,
+                Status(ErrorCodes::Interrupted,
+                       str::stream() << "State document " << documentId << " is dropped",
+                       BSON("documentId" << documentId)));
+        });
 }
 
 
diff --git a/src/mongo/db/s/resharding/resharding_coordinator_service.cpp b/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
index fad9f59d0d..d44cb994cd 100644
--- a/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
+++ b/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
@@ -40,6 +40,8 @@
 #include "mongo/db/s/resharding/resharding_metrics.h"
 #include "mongo/db/s/resharding/resharding_server_parameters_gen.h"
 #include "mongo/db/s/resharding_util.h"
+#include "mongo/db/session_catalog_mongod.h"
+#include "mongo/db/transaction_participant.h"
 #include "mongo/db/vector_clock.h"
 #include "mongo/logv2/log.h"
 #include "mongo/rpc/get_status_from_command_result.h"
@@ -822,8 +824,10 @@ void writeStateTransitionAndCatalogUpdatesThenBumpShardVersions(
         opCtx, coordinatorDoc, std::move(changeMetadataFunc));
 }
 
+template <typename Callback>
 void removeCoordinatorDocAndReshardingFields(OperationContext* opCtx,
-                                             const ReshardingCoordinatorDocument& coordinatorDoc) {
+                                             const ReshardingCoordinatorDocument& coordinatorDoc,
+                                             Callback onCommitCallback) {
     invariant(coordinatorDoc.getState() == CoordinatorStateEnum::kDecisionPersisted);
 
     ReshardingCoordinatorDocument updatedCoordinatorDoc = coordinatorDoc;
@@ -831,6 +835,15 @@ void removeCoordinatorDocAndReshardingFields(OperationContext* opCtx,
 
     executeStateTransitionAndMetadataChangesInTxn(
         opCtx, updatedCoordinatorDoc, [&](OperationContext* opCtx, TxnNumber txnNumber) {
+            {
+                opCtx->setTxnNumber(txnNumber);
+                MongoDOperationContextSession ocs(opCtx);
+                auto txnParticipant = TransactionParticipant::get(opCtx);
+                txnParticipant.unstashTransactionResources(opCtx, "delete");
+                opCtx->recoveryUnit()->onCommit(onCommitCallback);
+                txnParticipant.stashTransactionResources(opCtx);
+            }
+
             // Remove entry for this resharding operation from config.reshardingOperations
             writeToCoordinatorStateNss(opCtx, updatedCoordinatorDoc, txnNumber);
 
@@ -969,9 +982,7 @@ SemiFuture<void> ReshardingCoordinatorService::ReshardingCoordinator::run(
                 return;
             }
 
-            if (status.isOK()) {
-                _completionPromise.emplaceValue();
-            } else {
+            if (!status.isOK()) {
                 _completionPromise.setError(status);
             }
         })
@@ -1166,10 +1177,15 @@ ExecutorFuture<void> ReshardingCoordinatorService::ReshardingCoordinator::
 
     return whenAllSucceed(std::move(futures))
         .thenRunOn(**executor)
-        .then([executor](const auto& coordinatorDocsChangedOnDisk) {
+        .then([this, executor](const auto& coordinatorDocsChangedOnDisk) {
             auto opCtx = cc().makeOperationContext();
-            resharding::removeCoordinatorDocAndReshardingFields(opCtx.get(),
-                                                                coordinatorDocsChangedOnDisk[1]);
+            resharding::removeCoordinatorDocAndReshardingFields(
+                opCtx.get(),
+                coordinatorDocsChangedOnDisk[1],
+                [this](boost::optional<Timestamp> unusedCommitTime) {
+                    stdx::lock_guard<Latch> lk(_mutex);
+                    _completionPromise.emplaceValue();
+                });
         });
 }
 
diff --git a/src/mongo/db/s/resharding/resharding_donor_service.cpp b/src/mongo/db/s/resharding/resharding_donor_service.cpp
index 8b6bfab71a..0b5db0b00a 100644
--- a/src/mongo/db/s/resharding/resharding_donor_service.cpp
+++ b/src/mongo/db/s/resharding/resharding_donor_service.cpp
@@ -40,6 +40,7 @@
 #include "mongo/db/dbdirectclient.h"
 #include "mongo/db/index_builds_coordinator.h"
 #include "mongo/db/op_observer.h"
+#include "mongo/db/ops/delete.h"
 #include "mongo/db/persistent_task_store.h"
 #include "mongo/db/repl/repl_client_info.h"
 #include "mongo/db/s/resharding/resharding_data_copy_util.h"
@@ -176,10 +177,12 @@ SemiFuture<void> ReshardingDonorService::DonorStateMachine::run(
             return status;
         })
         .onCompletion([this, self = shared_from_this()](Status status) {
-            stdx::lock_guard<Latch> lg(_mutex);
-            if (_completionPromise.getFuture().isReady()) {
-                // interrupt() was called before we got here.
-                return;
+            {
+                stdx::lock_guard<Latch> lg(_mutex);
+                if (_completionPromise.getFuture().isReady()) {
+                    // interrupt() was called before we got here.
+                    return;
+                }
             }
 
             if (status.isOK()) {
@@ -188,8 +191,8 @@ SemiFuture<void> ReshardingDonorService::DonorStateMachine::run(
                 // the instance is deleted. It is necessary to use shared_from_this() to extend the
                 // lifetime so the code can safely finish executing.
                 _removeDonorDocument();
-                _completionPromise.emplaceValue();
             } else {
+                stdx::lock_guard<Latch> lg(_mutex);
                 _completionPromise.setError(status);
             }
         })
@@ -528,11 +531,31 @@ void ReshardingDonorService::DonorStateMachine::_updateDonorDocument(
 
 void ReshardingDonorService::DonorStateMachine::_removeDonorDocument() {
     auto opCtx = cc().makeOperationContext();
-    PersistentTaskStore<ReshardingDonorDocument> store(
-        NamespaceString::kDonorReshardingOperationsNamespace);
-    store.remove(opCtx.get(),
-                 BSON(ReshardingDonorDocument::k_idFieldName << _id),
-                 WriteConcerns::kMajorityWriteConcern);
+
+    const auto& nss = NamespaceString::kDonorReshardingOperationsNamespace;
+    writeConflictRetry(opCtx.get(), "DonorStateMachine::_removeDonorDocument", nss.toString(), [&] {
+        AutoGetCollection coll(opCtx.get(), nss, MODE_IX);
+
+        if (!coll) {
+            return;
+        }
+
+        WriteUnitOfWork wuow(opCtx.get());
+
+        opCtx->recoveryUnit()->onCommit([this](boost::optional<Timestamp> unusedCommitTime) {
+            stdx::lock_guard<Latch> lk(_mutex);
+            _completionPromise.emplaceValue();
+        });
+
+        deleteObjects(opCtx.get(),
+                      *coll,
+                      nss,
+                      BSON(ReshardingDonorDocument::k_idFieldName << _id),
+                      true /* justOne */);
+
+        wuow.commit();
+    });
+
     _donorDoc = {};
 }
 
diff --git a/src/mongo/db/s/resharding/resharding_recipient_service.cpp b/src/mongo/db/s/resharding/resharding_recipient_service.cpp
index 6d524a02dd..a36f4f64ae 100644
--- a/src/mongo/db/s/resharding/resharding_recipient_service.cpp
+++ b/src/mongo/db/s/resharding/resharding_recipient_service.cpp
@@ -33,7 +33,9 @@
 
 #include "mongo/db/catalog/rename_collection.h"
 #include "mongo/db/catalog_raii.h"
+#include "mongo/db/concurrency/write_conflict_exception.h"
 #include "mongo/db/dbdirectclient.h"
+#include "mongo/db/ops/delete.h"
 #include "mongo/db/persistent_task_store.h"
 #include "mongo/db/query/collation/collation_spec.h"
 #include "mongo/db/repl/oplog_applier.h"
@@ -232,10 +234,12 @@ SemiFuture<void> ReshardingRecipientService::RecipientStateMachine::run(
             return status;
         })
         .onCompletion([this, self = shared_from_this()](Status status) {
-            stdx::lock_guard<Latch> lg(_mutex);
-            if (_completionPromise.getFuture().isReady()) {
-                // interrupt() was called before we got here.
-                return;
+            {
+                stdx::lock_guard<Latch> lg(_mutex);
+                if (_completionPromise.getFuture().isReady()) {
+                    // interrupt() was called before we got here.
+                    return;
+                }
             }
 
             removeRecipientDocFailpoint.pauseWhileSet();
@@ -246,9 +250,9 @@ SemiFuture<void> ReshardingRecipientService::RecipientStateMachine::run(
                 // tied to the instance is deleted. It is necessary to use shared_from_this() to
                 // extend the lifetime so the code can safely finish executing.
                 _removeRecipientDocument();
-                _completionPromise.emplaceValue();
             } else {
                 // Set error on all promises
+                stdx::lock_guard<Latch> lg(_mutex);
                 _completionPromise.setError(status);
             }
         })
@@ -667,11 +671,33 @@ void ReshardingRecipientService::RecipientStateMachine::_updateRecipientDocument
 
 void ReshardingRecipientService::RecipientStateMachine::_removeRecipientDocument() {
     auto opCtx = cc().makeOperationContext();
-    PersistentTaskStore<ReshardingRecipientDocument> store(
-        NamespaceString::kRecipientReshardingOperationsNamespace);
-    store.remove(opCtx.get(),
-                 BSON(ReshardingRecipientDocument::k_idFieldName << _id),
-                 WriteConcerns::kMajorityWriteConcern);
+
+    const auto& nss = NamespaceString::kRecipientReshardingOperationsNamespace;
+    writeConflictRetry(
+        opCtx.get(), "RecipientStateMachine::_removeDonorDocument", nss.toString(), [&] {
+            AutoGetCollection coll(opCtx.get(), nss, MODE_IX);
+
+            if (!coll) {
+                return;
+            }
+
+            WriteUnitOfWork wuow(opCtx.get());
+
+            opCtx->recoveryUnit()->onCommit([this](boost::optional<Timestamp> unusedCommitTime) {
+                stdx::lock_guard<Latch> lk(_mutex);
+                _completionPromise.emplaceValue();
+            });
+
+            deleteObjects(opCtx.get(),
+                          *coll,
+                          nss,
+                          BSON(ReshardingRecipientDocument::k_idFieldName << _id),
+                          true /* justOne */);
+
+            wuow.commit();
+        });
+
+
     _recipientDoc = {};
 }

Comment by Max Hirschhorn [ 13/Feb/21 ]

I would suggest it might be cleaner to match the state doc by uuid and always create a new one on rerun. If this is impossible than this ticket is impossible, too.

Great points. The _id of the state documents in resharding is already a UUIDv4 value. A new resharding operation will use a fresh UUIDv4 value even when run for the same namespace again.

I think what makes this ticket difficult to address is that the user-facing reshardCollection command accepts the namespace string rather than the reshardingUUID. This is to be consistent with how the namespace string is the interface for all other MongoDB commands. The config server primary is responsible for generating the reshardingUUID but must also ensure a resharding operation isn't already in-progress for the collection. (SERVER-52730 will actually make the config server primary enforce there are no resharding operations in-progress for any collection in the sharded cluster.) Rather than the phrasing "Resharding may delete the state document before fully completing", I would say that resharding completing is defined as the coordinator state document having been removed.

I feel the reason why the donorStartMigration command can safely expose its UUIDv4 value is because on the sender is still going to be [MongoDB the company]'s code which can be vetted to avoid misuse.

Happy to chat more over Zoom on Tuesday.

Comment by Andrew Shuvalov (Inactive) [ 13/Feb/21 ]

Got you. I would suggest it might be cleaner to match the state doc by uuid and always create a new one on rerun. If this is impossible than this ticket is impossible, too.

I just want you to be aware that this race prevented a change I was trying to make. Specifically, I was trying to always interrupt any service instance when state doc is deleted, from the observer. It was fine for the tenant migration service because there the state doc is garbage collected, but it broke the resharding service because there the stare doc deletion is part of the flow. In my experiment, the interrupt made the service to fail.

Comment by Max Hirschhorn [ 11/Feb/21 ]

andrew.shuvalov, let me provide some details on what resharding is hoping to achieve. The goal is to make it possible to immediately rerun the reshardCollection command again (either on the same namespace or a different one) after it returns successfully. We therefore have the ReshardingCoordinator service set up so that the reshardCollection command doesn't return until the coordinator state document has been deleted. This means that the coordinator state document must be deleted before ReshardingCoordinator::_completionPromise is fulfilled.

For comparison, tenant migrations is marking the state document as able to be reaped by the TTL index. The actual deletion of the state document doesn't happen until later in wall-clock time.

I don't think I have enough context on SERVER-53950 to understand what you were wanting to achieve. It seems mostly arbitrary to me for PrimaryOnlyServiceOpObserver::onDelete() to attempt to interrupt the primary-only service Instance. By removing the state document, the primary-only service Instance is saying "I'm done; don't restart me if there's a failover". I don't see why it would need to give attention to being interrupted when the primary-only service Instance has already declared there's no more work left to do.

Generated at Thu Feb 08 05:33:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.