[SERVER-57772] Failpoints on mongos rewrite state change error codes in writeConcernError Created: 17/Jun/21  Updated: 29/Oct/23  Resolved: 29/Jun/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.0.6, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Kevin Albertson Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to CDRIVER-4022 Skip /WriteCommand/invalid_wc_server_... Closed
is related to SERVER-58920 Enable multiversion testing of rewrit... Closed
is related to SERVER-50549 Transform connection-related error co... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0, v4.4, v4.2, v4.0
Steps To Reproduce:

Start a sharded cluster version 5.0.0-alpha0-856-gf4e7955.

Using the shell, configure a failpoint on the "insert" command using an error code that represents a server state change.

var code = 91; // ShutdownInProgress
var cmd = {
    configureFailPoint: "failCommand",
    mode: {times: 1},
    data: {
        failCommands: ["insert"],
        writeConcernError: {code: code, errmsg: "Replication is being shut down"}
    }
};
db.adminCommand(cmd);
db.runCommand({insert: "coldb.runCommand({insert: "coll", documents: [{x:1}]});

Results in:

{
        "n" : 1,
        "writeConcernError" : {
                "code" : 6,
                "errmsg" : "Replication is being shut down"
        },
        "ok" : 1,
        "$clusterTime" : {
                "clusterTime" : Timestamp(1623887155, 1),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        },
        "operationTime" : Timestamp(1623887155, 1)
}

The error code 91 was rewritten to 6 (HostUnreachable).

Sprint: Service Arch 2021-06-28, Service Arch 2021-07-12
Participants:

 Description   

Setting a failpoint on mongos to return a writeConcernError containing a state change error code is rewritten to HostUnreachable (6). I think this is caused by the changes of SERVER-50549.

State change errors indicate the server has changed state (e.g. 91 = ShutdownInProgress or 10107=NotWritablePrimary). Drivers document the state change errors they check for in the Server Discovery and Monitoring specification.

Drivers expect failpoints on mongos to return the errors exactly as they are configured. This enables the test scenario of a mongos returning a state change error itself (instead of rewriting one from a backing mongod).

The writeConcernError in particular only affects one test in the C driver, and is easy to work around. This is not blocking driver tests currently.



 Comments   
Comment by Githook User [ 08/Dec/21 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-57772 suppress state-change rewrite when `writeConcernError` is injected by `failCommand`.

(cherry picked from commit 7396af4803b0b9b729c457f54defca0c4c51b61f)
Branch: v5.0
https://github.com/mongodb/mongo/commit/1fd786804af6a5b3967c3cacf9fe1e23567569bb

Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 17/Sep/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-57772 Omit test from multiversion tests until backport
Branch: master
https://github.com/mongodb/mongo/commit/507c5c0a002849999f6f4861973f4f235be2fb4a

Comment by Githook User [ 29/Jun/21 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-57772 suppress state-change rewrite when `writeConcernError` is injected by `failCommand`.
Branch: master
https://github.com/mongodb/mongo/commit/7396af4803b0b9b729c457f54defca0c4c51b61f

Comment by Billy Donahue [ 24/Jun/21 ]

Code Review: https://mongodbcr.appspot.com/794680001/

Comment by Billy Donahue [ 24/Jun/21 ]

Oh there's an extra failCommand evaluation site I missed in the rebase to master!
I wish we had a rule that any FailPoint can only be evaluated in one place.

The fix is small.

 
diff --git a/src/mongo/s/commands/strategy.cpp b/src/mongo/s/commands/strategy.cpp
index 5e35533fff..b8b53d03ab 100644
--- a/src/mongo/s/commands/strategy.cpp
+++ b/src/mongo/s/commands/strategy.cpp
@@ -305,6 +305,15 @@ void ExecCommandClient::_epilogue() {
     if (_invocation->supportsWriteConcern()) {
         failCommand.executeIf(
             [&](const BSONObj& data) {
+                if (bool b; !bsonExtractBooleanField(data, "allowRewriteStateChange", &b).isOK() || !b)
+                    rpc::RewriteStateChangeErrors::setEnabled(opCtx, false);
                 result->getBodyBuilder().append(data["writeConcernError"]);
                 if (data.hasField(kErrorLabelsFieldName) &&
                     data[kErrorLabelsFieldName].type() == Array) {

Comment by Billy Donahue [ 23/Jun/21 ]

This does seem to be unexpected behavior.

I am wondering if the failCommand FailPoint was set on the mongos or on a mongod connected to it.
I can't tell what's going on with the cluster topology from the repro steps.

Generated at Thu Feb 08 05:42:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.