[SERVER-73916] Improve ReshardingTest fixture error reporting when reshardCollection has already failed before any failpoints are waited on Created: 11/Feb/23  Updated: 29/Oct/23  Resolved: 14/Feb/23

Status: Closed
Project: Core Server
Component/s: Sharding, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 5.0.15, 7.0.0-rc0, 6.0.5

Type: Improvement Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
is related to SERVER-61985 resharding_coordinator_recovers_abort... Closed
is related to SERVER-65201 De-emphasize errors which are ignored... Closed
Assigned Teams:
Sharding NYC
Backwards Compatibility: Fully Compatible
Backport Requested:
v6.0, v5.0
Sprint: Sharding NYC 2023-02-20
Participants:
Linked BF Score: 49

 Description   

To avoid hanging or crashing the mongo shell process, the ReshardingTest fixture goes through some lengths to interrupt the reshardCollection command on mongos and join the background thread in the mongo shell which was running the reshardCollection command. However, after the changes from 0d5fd57 as part of SERVER-61985, the reshardCollection command may be blocked on the reshardingPauseCoordinatorBeforeCompletion failpoint without having waited on the interceding reshardingPauseCoordinatorBeforeBlockingWrites and reshardingPauseCoordinatorBeforeDecisionPersisted failpoints. This means the _commandDoneSignal won't be decremented to break the ReshardingTest fixture out of _waitForFailPoint(). Instead the mongo shell reports a scenario where the reshardCollection will have definitively failed as an assert.soon() failpoint of not reaching the expected failpoint within the allotted time.



 Comments   
Comment by Githook User [ 15/Feb/23 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-73916 Improve error in ReshardingTest on failpoint not reached.

Enables the ReshardingTest fixture to use the
reshardingPauseCoordinatorBeforeCompletion failpoint being hit earlier
than expected as an indication the other resharding-related failpoints
won't ever be reached.

(cherry picked from commit 75ff482b93060c1c8a28f418dc55d16c4fcc02b6)
Branch: v6.0
https://github.com/mongodb/mongo/commit/e6158e24d738c37a00ce458ccdb4ad78ae4c8d8d

Comment by Githook User [ 15/Feb/23 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-73916 Improve error in ReshardingTest on failpoint not reached.

Enables the ReshardingTest fixture to use the
reshardingPauseCoordinatorBeforeCompletion failpoint being hit earlier
than expected as an indication the other resharding-related failpoints
won't ever be reached.

(cherry picked from commit 75ff482b93060c1c8a28f418dc55d16c4fcc02b6)
Branch: v5.0
https://github.com/mongodb/mongo/commit/1b7dd95a8f205c7daa2034790b2e973f3542b4a0

Comment by Githook User [ 14/Feb/23 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-73916 Improve error in ReshardingTest on failpoint not reached.

Enables the ReshardingTest fixture to use the
reshardingPauseCoordinatorBeforeCompletion failpoint being hit earlier
than expected as an indication the other resharding-related failpoints
won't ever be reached.
Branch: master
https://github.com/mongodb/mongo/commit/75ff482b93060c1c8a28f418dc55d16c4fcc02b6

Comment by Abdul Qadeer [ 14/Feb/23 ]

Yes it does! Thank you for clarifying.

Comment by Max Hirschhorn [ 13/Feb/23 ]

The waitForFailPoint command permits waiting for failpoints which are currently disabled. It won't return a "failpoint not found" error. For example, the following additional calls to the waitForFailPoint command all succeed because the timesEntered check is based on when the failpoint was first turned on and no the last observed value for the number of times the failpoint has been entered so far. Does this cover the case you were imagining?

diff --git a/jstests/sharding/libs/resharding_test_fixture.js b/jstests/sharding/libs/resharding_test_fixture.js
index 283f6a91039..99703f1227d 100644
--- a/jstests/sharding/libs/resharding_test_fixture.js
+++ b/jstests/sharding/libs/resharding_test_fixture.js
@@ -537,6 +537,9 @@ var ReshardingTest = class {
 
                 if (completionFailpoint !== fp && completionFailpoint.waitWithTimeout(1000)) {
                     completionFailpoint.off();
+                    completionFailpoint.waitWithTimeout(1000);
+                    completionFailpoint.waitWithTimeout(1000);
+                    completionFailpoint.waitWithTimeout(1000);
                 }
 
                 return false;

Comment by Abdul Qadeer [ 13/Feb/23 ]

The idea looks good to me but in the patch it is possible that completionFailpoint.waitWithTimeout(1000) is called twice (_commandDoneSignal.getCount() is still 1 in the retry of assert and

fp.waitWithTimeout(1000) returns false) and the second invocation may fail with "failpoint not found" error as we turn it off already the first time.

Comment by Max Hirschhorn [ 11/Feb/23 ]

One idea would be to check whether the ReshardingCoordinator is blocked on the reshardingPauseCoordinatorBeforeCompletion failpoint and disable it to allow the _commandDoneSignal to be decremented. What do you think abdul.qadeer@mongodb.com?

diff --git a/jstests/sharding/libs/resharding_test_fixture.js b/jstests/sharding/libs/resharding_test_fixture.js
index d4820f592c6..283f6a91039 100644
--- a/jstests/sharding/libs/resharding_test_fixture.js
+++ b/jstests/sharding/libs/resharding_test_fixture.js
@@ -515,7 +515,9 @@ var ReshardingTest = class {
      * proceeding to the next stage. This helper returns after either:
      *
      * 1) The node's waitForFailPoint returns successfully or
-     * 2) The `reshardCollection` command has returned a response.
+     * 2) The `reshardCollection` command has returned a response or
+     * 3) The ReshardingCoordinator is blocked on the reshardingPauseCoordinatorBeforeCompletion
+     *    failpoint and won't ever satisfy the supplied failpoint.
      *
      * The function returns true when we returned because the server reached the failpoint. The
      * function returns false when the `reshardCollection` command is no longer running.
@@ -524,9 +526,20 @@ var ReshardingTest = class {
      * @private
      */
     _waitForFailPoint(fp) {
+        const completionFailpoint = this._pauseCoordinatorBeforeCompletionFailpoints.find(
+            completionFailpoint => completionFailpoint.conn.host === fp.conn.host);
+
         assert.soon(
             () => {
-                return this._commandDoneSignal.getCount() === 0 || fp.waitWithTimeout(1000);
+                if (this._commandDoneSignal.getCount() === 0 || fp.waitWithTimeout(1000)) {
+                    return true;
+                }
+
+                if (completionFailpoint !== fp && completionFailpoint.waitWithTimeout(1000)) {
+                    completionFailpoint.off();
+                }
+
+                return false;
             },
             "Timed out waiting for failpoint to be hit. Failpoint: " + fp.failPointName,
             undefined,
diff --git a/src/mongo/db/s/resharding/resharding_coordinator_service.cpp b/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
index e57ec94edae..2fe85a02ad0 100644
--- a/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
+++ b/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
@@ -2037,6 +2037,9 @@ ExecutorFuture<void> ReshardingCoordinator::_awaitAllRecipientsFinishedApplying(
                 });
         })
         .then([this, executor] {
+            uasserted(ErrorCodes::InternalError,
+                      "Faking an error happening before blocking writes");
+
             {
                 auto opCtx = _cancelableOpCtxFactory->makeOperationContext(&cc());
                 reshardingPauseCoordinatorBeforeBlockingWrites.pauseWhileSetAndNotCanceled(

Generated at Thu Feb 08 06:25:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.