[SERVER-46395] Resumable range deleter can submit duplicate range deletion tasks on step up Created: 25/Feb/20  Updated: 29/Oct/23  Resolved: 13/Mar/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.4.0-rc0, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
causes SERVER-46794 Usage of ScopedTaskExecutor::joinAsyn... Closed
causes SERVER-46859 Use namespace parameter with find fai... Closed
Related
related to SERVER-46849 Range deletion task submission via Op... Closed
is related to SERVER-42192 Write a concurrency workload to test ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Steps To Reproduce:

Apply the following patch build introduces an invariant into MetadataManager::_submitRangeForDeletion() to check for range deletion tasks on the same chunk range and run the following resmoke.py invocation.

python buildscripts/resmoke.py --suite=sharding jstests/sharding/resumable_range_delete_step_down_then_up.js

diff --git a/jstests/sharding/resumable_range_delete_step_down_then_up.js b/jstests/sharding/resumable_range_delete_step_down_then_up.js
new file mode 100644
index 0000000000..5bda43b509
--- /dev/null
+++ b/jstests/sharding/resumable_range_delete_step_down_then_up.js
@@ -0,0 +1,51 @@
+(function() {
+"use strict";
+
+load("jstests/libs/fail_point_util.js");
+
+const st = new ShardingTest({mongos: 1, config: 1, shards: 2, rs: {nodes: 1}});
+const dbName = "test";
+const collName = "mycoll";
+const collection = st.s.getDB(dbName).getCollection(collName);
+
+// Create a sharded collection with two chunks on shard0, split at the key {x: -1}.
+assert.commandWorked(st.s.adminCommand({enableSharding: dbName}));
+assert.commandWorked(st.s.adminCommand({movePrimary: dbName, to: st.shard0.shardName}));
+assert.commandWorked(st.s.adminCommand({shardCollection: collection.getFullName(), key: {x: 1}}));
+assert.commandWorked(st.s.adminCommand({split: collection.getFullName(), middle: {x: -1}}));
+
+// Insert some documents into the collection, but only into the higher of the two chunks.
+assert.commandWorked(collection.insert(Array.from({length: 100}, (_, i) => ({_id: i, x: i}))));
+
+// Enable the failpoint to cause range deletion to hang indefinitely.
+const suspendRangeDeletionFailpoint = configureFailPoint(st.shard0, "suspendRangeDeletion");
+
+// Note we're using _waitForDelete=false to ensure the command completes because we're intentionally
+// causing range deletion to hang.
+assert.commandWorked(st.s.adminCommand({
+    moveChunk: collection.getFullName(),
+    find: {x: 1},
+    to: st.shard1.shardName,
+    _waitForDelete: false,
+}));
+
+jsTestLog("Waiting for the suspendRangeDeletion failpoint to be hit");
+suspendRangeDeletionFailpoint.wait();
+
+// We step the shard0 primary down and then back up again. On stepping up, the shard0 primary will
+// create a new range deletion task.
+const shard1Primary = st.rs0.getPrimary();
+assert.commandWorked(shard1Primary.adminCommand({
+    replSetStepDown: ReplSetTest.kForeverSecs,
+    force: true,
+}));
+
+st.rs0.waitForState(shard1Primary, ReplSetTest.State.SECONDARY);
+assert.commandWorked(shard1Primary.adminCommand({replSetFreeze: 0}));
+assert.commandWorked(shard1Primary.adminCommand({replSetStepUp: 1}));
+st.rs0.waitForState(shard1Primary, ReplSetTest.State.PRIMARY);
+
+suspendRangeDeletionFailpoint.off();
+
+st.stop();
+})();
diff --git a/src/mongo/db/s/metadata_manager.cpp b/src/mongo/db/s/metadata_manager.cpp
index 4523d1761d..4ae566e7ba 100644
--- a/src/mongo/db/s/metadata_manager.cpp
+++ b/src/mongo/db/s/metadata_manager.cpp
@@ -450,6 +450,11 @@ SharedSemiFuture<void> MetadataManager::_submitRangeForDeletion(
                                delayForActiveQueriesOnSecondariesToComplete,
                                Milliseconds(rangeDeleterBatchDelayMS.load()));
 
+    for (auto const& [alreadyScheduledRange, _] : _rangesScheduledForDeletion) {
+        invariant(range != alreadyScheduledRange,
+                  "attempted to schedule duplicate range deletion task");
+    }
+
     _rangesScheduledForDeletion.emplace_front(range, cleanupComplete);
     // Attach a continuation so that once the range has been deleted, we will remove the deletion
     // from the _rangesScheduledForDeletion.  std::list iterators are never invalidated, which

Sprint: Sharding 2020-03-09, Sharding 2020-03-23
Participants:
Linked BF Score: 0

 Description   

If the replica set shard primary steps down before it has processed a range deletion task from a chunk moving off the shard and very shortly after steps back up, then it is possible for multiple range deletion tasks to be scheduled and eventually run. This is problematic because the first task completing will remove the document from config.rangeDeletions collection. Subsequent checks for whether the deletion of a range is scheduled are based on the contents of the config.rangeDeletions collection (rather than the in-memory state) and would mean a chunk migration for an overlapping range could be accepted by the shard. The range deleter would then be able to remove documents that are actually owned by the shard.

Credit to esha.maharishi for finding this issue.



 Comments   
Comment by Githook User [ 28/Mar/20 ]

Author:

{'name': 'Max Hirschhorn', 'username': 'visemet', 'email': 'max.hirschhorn@mongodb.com'}

Message: SERVER-46395 Ensure range deletion task document exists during deletion.

Changes the range deletion task to guarantee a batch of documents can
only be removed while the corresponding range deletion task document
exists. If the range deletion task document doesn't exist or a stepdown
could lead to it not existing, then the range deletion task is
abandoned.

Changes the deletion of the range deletion task document to use the _id
index of the config.rangeDeletions collection rather than being a
collection scan.

(cherry picked from commit 90eefa051e6015514dcc6256d0f42b76bf041a76)
(cherry picked from commit 5e57707c9dc6e53ecf1dab94e37363a0e505ef8e)
Branch: v4.4
https://github.com/mongodb/mongo/commit/99244603baf17fcc9627852e86416b5cebca6451

Comment by Githook User [ 12/Mar/20 ]

Author:

{'name': 'Max Hirschhorn', 'username': 'visemet', 'email': 'max.hirschhorn@mongodb.com'}

Message: SERVER-46395 Fix unittest compile.
Branch: master
https://github.com/mongodb/mongo/commit/5e57707c9dc6e53ecf1dab94e37363a0e505ef8e

Comment by Githook User [ 11/Mar/20 ]

Author:

{'name': 'Max Hirschhorn', 'username': 'visemet', 'email': 'max.hirschhorn@mongodb.com'}

Message: SERVER-46395 Ensure range deletion task document exists during deletion.

Changes the range deletion task to guarantee a batch of documents can
only be removed while the corresponding range deletion task document
exists. If the range deletion task document doesn't exist or a stepdown
could lead to it not existing, then the range deletion task is
abandoned.

Changes the deletion of the range deletion task document to use the _id
index of the config.rangeDeletions collection rather than being a
collection scan.
Branch: master
https://github.com/mongodb/mongo/commit/90eefa051e6015514dcc6256d0f42b76bf041a76

Comment by Githook User [ 02/Mar/20 ]

Author:

{'username': 'visemet', 'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com'}

Message: SERVER-46395 Add joinAsync() method to ScopedTaskExecutor::Impl.

Introduces a joinAsync() method to the TaskExecutor interface to provide
a non-blocking way for callers to learn when the task executor has been
shut down and all outstanding callbacks have finished running.
Branch: master
https://github.com/mongodb/mongo/commit/82833b418676787209c7ce9fcc658470c478070e

Generated at Thu Feb 08 05:11:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.