[SERVER-62175] Mongos fails to attach RetryableWrite Error Label For Command Interrupted In _parseCommand Created: 17/Dec/21  Updated: 29/Oct/23  Resolved: 10/May/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.3.0
Fix Version/s: 5.0.9, 4.4.15, 6.0.0-rc6, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Luis Osta (Inactive) Assignee: Rachita Dhawan
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File git.diff    
Issue Links:
Backports
Duplicate
is duplicated by SERVER-64046 Mongos does not return errorlabels if... Closed
Problem/Incident
causes SERVER-70518 Coverity analysis defect 122973: Dere... Closed
Related
related to SERVER-60347 Add retryable writes suite that perio... Backlog
is related to SERVER-53624 4.4 mongos does not attach RetryableW... Closed
is related to SERVER-55648 Mongos doesn't return top-level batch... Closed
Backwards Compatibility: Minor Change
Operating System: ALL
Backport Requested:
v6.0, v5.3, v5.0, v4.4
Steps To Reproduce:

The following resmoke invocation will execute the reproducing test once the git patch has been applied.

/buildscripts/resmoke.py run --suite=sharding jstests/sharding/mongos_insert_fails_with_shutdown.js 

The patch can be applied by executing the following command in the root of the repository.

git apply git.diff

diff --git a/jstests/sharding/mongos_insert_fails_with_shutdown.js b/jstests/sharding/mongos_insert_fails_with_shutdown.js
new file mode 100644
index 00000000000..ea21ee346c4
--- /dev/null
+++ b/jstests/sharding/mongos_insert_fails_with_shutdown.js
@@ -0,0 +1,35 @@
+/**
+ */
+ 
+ (function() {
+    "use strict";
+     
+    load("jstests/libs/fail_point_util.js");
+    load('jstests/libs/parallelTester.js');
+     
+    const st = new ShardingTest({
+        mongos: 1,
+        config: 1,
+        shards: 2,
+    });
+     
+    const hangBeforeCheckInterruptFailPoint = configureFailPoint(st.s, "hangBeforeCheckInterrupt");
+     
+    const dbName = "test";
+    const collName = "mycoll";
+     
+    const insertThread = new Thread(function insertDoc(host, dbName, collName) {
+        const conn = new Mongo(host);
+        const collection = conn.getDB(dbName).getCollection(collName);
+        const res = collection.insert({key: 1});
+        jsTest.log(`Inserted document with _id: ${tojson(res)}`);
+        assert.commandFailedWithCode(res, ErrorCodes.InterruptedAtShutdown);
+    }, st.s.host, dbName, collName);
+     
+    insertThread.start();
+    hangBeforeCheckInterruptFailPoint.wait();
+    st.stopMongos(0);
+    insertThread.join();
+     
+    st.stop();
+    })();
\ No newline at end of file
diff --git a/src/mongo/s/commands/strategy.cpp b/src/mongo/s/commands/strategy.cpp
index b9d903e6685..abb9a892a3c 100644
--- a/src/mongo/s/commands/strategy.cpp
+++ b/src/mongo/s/commands/strategy.cpp
@@ -27,6 +27,7 @@
  *    it in the license file.
  */
 
+#include "mongo/util/time_support.h"
 #define MONGO_LOGV2_DEFAULT_COMPONENT ::mongo::logv2::LogComponent::kSharding
 
 #include "mongo/platform/basic.h"
@@ -372,6 +373,7 @@ Future<void> ExecCommandClient::run() {
 }
 
 MONGO_FAIL_POINT_DEFINE(doNotRefreshShardsOnRetargettingError);
+MONGO_FAIL_POINT_DEFINE(hangBeforeCheckInterrupt);
 
 /**
  * Produces a future-chain that parses the command, runs the parsed command, and captures the result
@@ -485,7 +487,6 @@ void ParseAndRunCommand::_parseCommand() {
     const auto& m = _rec->getMessage();
     const auto& request = _rec->getRequest();
     auto replyBuilder = _rec->getReplyBuilder();
-
     auto const command = CommandHelpers::findCommand(_commandName);
     if (!command) {
         const std::string errorMsg = "no such cmd: {}"_format(_commandName);
@@ -528,6 +529,11 @@ void ParseAndRunCommand::_parseCommand() {
     if (maxTimeMS > 0 && command->getLogicalOp() != LogicalOp::opGetMore) {
         opCtx->setDeadlineAfterNowBy(Milliseconds{maxTimeMS}, ErrorCodes::MaxTimeMSExpired);
     }
+    if (_commandName == "insert") {
+        LOGV2(555555, "ABOUT TO PAUSE");
+        hangBeforeCheckInterrupt.pauseWhileSet();
+        LOGV2(555555, "PAST THE FAILPOINT");
+    }
     opCtx->checkForInterrupt();  // May trigger maxTimeAlwaysTimeOut fail point.
 
     // If the command includes a 'comment' field, set it on the current OpCtx.
diff --git a/src/mongo/s/mongos_main.cpp b/src/mongo/s/mongos_main.cpp
index 0d14e523838..534a1638ea9 100644
--- a/src/mongo/s/mongos_main.cpp
+++ b/src/mongo/s/mongos_main.cpp
@@ -27,6 +27,7 @@
  *    it in the license file.
  */
 
+#include "mongo/util/time_support.h"
 #define MONGO_LOGV2_DEFAULT_COMPONENT ::mongo::logv2::LogComponent::kSharding
 
 #include "mongo/platform/basic.h"
@@ -315,11 +316,13 @@ void cleanupTask(const ShutdownTaskArgs& shutdownArgs) {
 
         if (serviceContext) {
             serviceContext->setKillAllOperations();
-
+            LOGV2(55555, "ABOUT TO TURN OFF FAILPOINT");
+            globalFailPointRegistry().find("hangBeforeCheckInterrupt")->setMode(FailPoint::Mode::off, 0);
             if (MONGO_unlikely(pauseWhileKillingOperationsAtShutdown.shouldFail())) {
                 LOGV2(4701800, "pauseWhileKillingOperationsAtShutdown failpoint enabled");
                 sleepsecs(1);
             }
+            sleepsecs(3);
         }
 
         // Perform all shutdown operations after setKillAllOperations is called in order to ensure

Sprint: Sharding NYC 2022-04-04, Sharding NYC 2022-04-18, Sharding 2022-05-02, Sharding NYC 2022-05-16
Participants:
Story Points: 3

 Description   

This issue was originally discovered in the linked HELP ticket. It was found due to the shutdown that is required as part of version upgrades in Atlas.

The fundamental issue is due to how the ServiceEntryPoint logic in MongoS works.

When a command fails, it calls getErrorLabels in order to attach the appropriate information to the response. Relevant to our discussion is that it uses the sessionInformation in _osi to determine whether or not to attach the kRetryableWrite label.

But the problem is that _osi is emplaced after a call to checkForInterrupt.

Which results in the retryable write label not being attached to the response even though it should be.



 Comments   
Comment by Githook User [ 17/May/22 ]

Author:

{'name': 'Rachita Dhawan', 'email': 'rachita.dhawan@gmail.com', 'username': 'racdhawan'}

Message: SERVER-62175 Mongos fails to attach RetryableWrite Error Label For Command Interrupted In _parseCommand

(cherry picked from commit 38e4a5516d614a2bd1f3c3afde97c19068fd1441)
Branch: v6.0
https://github.com/mongodb/mongo/commit/ce47119ff1a1ab79f18235f1f42eeef39da6f20c

Comment by Githook User [ 11/May/22 ]

Author:

{'name': 'Rachita Dhawan', 'email': 'rachita.dhawan@gmail.com', 'username': 'racdhawan'}

Message: SERVER-62175 Mongos fails to attach RetryableWrite Error Label For Command Interrupted In _parseCommand
Branch: v5.0
https://github.com/mongodb/mongo/commit/cfd244ba1e0cb73af6ee8807961db74a38a097d0

Comment by Githook User [ 10/May/22 ]

Author:

{'name': 'Rachita Dhawan', 'email': 'rachita.dhawan@gmail.com', 'username': 'racdhawan'}

Message: SERVER-62175 Mongos fails to attach RetryableWrite Error Label For Command Interrupted In _parseCommand

backport from commit 38e4a5516d614a2bd1f3c3afde97c19068fd1441
Branch: v4.4
https://github.com/mongodb/mongo/commit/242dbac1c8b619f5ae0df17e494cb643ecf3decb

Comment by Githook User [ 04/May/22 ]

Author:

{'name': 'Rachita Dhawan', 'email': 'rachita.dhawan@gmail.com', 'username': 'racdhawan'}

Message: SERVER-62175 Mongos fails to attach RetryableWrite Error Label For Command Interrupted In _parseCommand
Branch: master
https://github.com/mongodb/mongo/commit/38e4a5516d614a2bd1f3c3afde97c19068fd1441

Comment by Jennifer Huang (Inactive) [ 12/Apr/22 ]

Thanks Lamont, I'll pass that on to the customer.

Comment by Lamont Nelson [ 11/Apr/22 ]

Hi jennifer.huang@mongodb.com we are currently starting work on this. We plan to backport the fix to 5.0 and 4.4. We should be able to fix this in master over the coming iteration, and the backports will take a couple days after that. Then the fix will have to be tested through our release candidate process for each version. Does this help?

Comment by Blake Oler [ 01/Mar/22 ]

Hey, after looking at this ticket, we've come squarely to the knowledge that Service Architecture doesn't own the underlying logic here. Passing back to Sharding.

Comment by Lingzhi Deng [ 01/Mar/22 ]

I think the expected behavior is that mongos should always be able to return the right error labels regardless of the stage a command fails at as it should have all the information it needs from the command to differentiate what the client request is and what error labels to attach. The fact that getErrorLabels label is not called in certain exit path seems a bug to me. And when getErrorLabels does get called, ideally mongos needs to be able to answer reliably these if clauses here and figure out the right error label to return.

Comment by Vojislav Stojkovic [ 01/Mar/22 ]

I looked at the code, and it doesn't look like getErrorLabels call that you linked to would be called if checkForInterrupt inside _parseCommand throws. As far as I can tell, if _parseCommand throws, the execution flow wouldn't go through the ParseAndRunCommand::RunInvocation::_tapOnError function at all.

Can you clarify what the expected behavior is?

Generated at Thu Feb 08 05:54:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.