[SERVER-60466] Support drivers gossiping signed $clusterTimes to replica set --shardsvrs before addShard is run Created: 05/Oct/21  Updated: 29/Oct/23  Resolved: 09/Jun/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 7.0.0-rc6, 6.0.9, 5.0.21

Type: New Feature Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Cheahuychou Mao
Resolution: Fixed Votes: 7
Labels: invisiblesharding-m1, phase3, replace-atlas-proxy-w-mongoq, serverless-routing
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-64869 Test that replica set can be converte... Closed
Related
related to SERVER-32672 Standalone replica set shards reject ... Closed
related to SERVER-77994 Make cluster_time_across_add_shard.js... Closed
related to DOCS-15066 Update page for Convert a Replica Set... Closed
Assigned Teams:
Sharding NYC
Backwards Compatibility: Fully Compatible
Backport Requested:
v7.0, v6.0, v5.0
Sprint: Sharding 2021-11-15, Sharding 2021-11-29, Sharding 2021-12-13, Sharding 2021-12-27, Sharding 2022-01-10, Sharding 2022-01-24, Sharding NYC 2023-05-29, Sharding NYC 2023-06-12
Participants:
Case:

 Description   

The Convert a Replica Set to a Sharded Cluster flow has users take ordinary replica set members out of rotation and start them up again with --shardsvr. The user's application remains directly connected to the replica set during this step. A driver would have previously received signed $clusterTimes from the ordinary replica set members and will therefore attempt to gossip them back to the members after they've been started up again with --shardsvr.

However, the behavior since MongoDB 3.6 has been to only initialize the LogicalClockValidator after the addShard command is run for the replica set shard and the shardIdentity document is inserted into the shard. In particular, the LogicalClockValidator isn't initialized on startup for --shardsvrs which have yet to be added to the sharded cluster.

The LogicalClockValidator being uninitialized on startup leads the client to receive a CannotVerifyAndSignLogicalTime error response for any command request which included a signed $clusterTime. (Restarting ALL of the application servers would clear the signed $clusterTimes known to the MongoClient but would be disruptive to the user's environment.) We should instead have the replica set --shardsvr use its existing admin.system.keys collection to validate and sign new $clusterTimes.

{  "ok" : 0,  "errmsg" : "Cannot accept logicalTime: { ts: Timestamp(1633407217, 1) }. May not be a part of a sharded cluster",  "code" : 210,  "codeName" : "CannotVerifyAndSignLogicalTime" }

Additionally, we should have the existing keys in the admin.system.keys collection remain available for validating $clusterTimes to avoid generating errors from the replica set shard immediately switching over to the keys in the admin.system.keys collection on the config server.

{  "operationTime" : Timestamp(1633407272, 1),  "ok" : 0,  "errmsg" : "Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1633407217, 1) } with id: 7015346542735261700",  "code" : 211,  "codeName" : "KeyNotFound",  "$gleStats" : {  "lastOpTime" : Timestamp(0, 0),  "electionId" : ObjectId("000000000000000000000000") },  "lastCommittedOpTime" : Timestamp(1633407272, 1),  "$configServerState" : {  "opTime" : {  "ts" : Timestamp(1633407273, 5),  "t" : NumberLong(2) } },  "$clusterTime" : {  "clusterTime" : Timestamp(1633407273, 5),  "signature" : {  "hash" : BinData(0,"U6uLmfGZRh/Cs29KzMOM3WMV6J8="),  "keyId" : NumberLong("7015430586655309838") } } }



 Comments   
Comment by Githook User [ 08/Aug/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Remove references to config shard in cluster_time_across_add_shard.js

(cherry picked from commit c5e62e4b485531df278de434ae42bb3b2e8ab55b)
Branch: v5.0
https://github.com/mongodb/mongo/commit/210e1eb9c2d48d915a37a5654eda166a6b9394a5

Comment by Githook User [ 08/Aug/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Support drivers gossiping signed $clusterTimes to shardsvr replica set before and after addShard is run

(cherry picked from commit cd0be4aab3f70b77f07c582f9e05cabdd0264c3f)
(cherry picked from commit 020fe38aad44906016ac06d5bf557cd02cd9ef3b)
(cherry picked from commit 142b591ad4949e14da89bc8990873b4a0c1f9204)
Branch: v5.0
https://github.com/mongodb/mongo/commit/eb5c71fe04f93332af89873a7109244126d7bb78

Comment by Githook User [ 08/Aug/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Make KeysCollectionClientSharded support returning external keys

(cherry picked from commit 6d077630145b3b9f1618710b575ad3cbd94386a2)
(cherry picked from commit a8a6d7272c064758c2b1943a33c31731861d3fc2)
(cherry picked from commit baf1c46bf279fb29182329014cfc0df92ad2db3a)
Branch: v5.0
https://github.com/mongodb/mongo/commit/822f1d3793fe840c0dc1162c876031c9970ca4ce

Comment by Githook User [ 08/Aug/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Move the helpers for creating and inserting external key documents into a util file

(cherry picked from commit 0c10ace040dd8cda65e3b0de5f6295e6f3f530c6)
(cherry picked from commit c0879fcf9a5f2129bf4e894131aa7ac8cbbd50c0)
(cherry picked from commit 4a5058bdda8c0b8a81b963c482bca3393480ad06)
Branch: v5.0
https://github.com/mongodb/mongo/commit/d95c4ac4e0c9a9affdca569b0a8812bbf87f8037

Comment by Githook User [ 07/Jul/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Remove references to config shard in cluster_time_across_add_shard.js
Branch: v6.0
https://github.com/mongodb/mongo/commit/c5e62e4b485531df278de434ae42bb3b2e8ab55b

Comment by Githook User [ 06/Jul/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Support drivers gossiping signed $clusterTimes to shardsvr replica set before and after addShard is run

(cherry picked from commit cd0be4aab3f70b77f07c582f9e05cabdd0264c3f)
(cherry picked from commit 020fe38aad44906016ac06d5bf557cd02cd9ef3b)
Branch: v6.0
https://github.com/mongodb/mongo/commit/142b591ad4949e14da89bc8990873b4a0c1f9204

Comment by Githook User [ 06/Jul/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Make KeysCollectionClientSharded support returning external keys

(cherry picked from commit 6d077630145b3b9f1618710b575ad3cbd94386a2)
(cherry picked from commit a8a6d7272c064758c2b1943a33c31731861d3fc2)
Branch: v6.0
https://github.com/mongodb/mongo/commit/baf1c46bf279fb29182329014cfc0df92ad2db3a

Comment by Githook User [ 06/Jul/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Move the helpers for creating and inserting external key documents into a util file

(cherry picked from commit 0c10ace040dd8cda65e3b0de5f6295e6f3f530c6)
(cherry picked from commit c0879fcf9a5f2129bf4e894131aa7ac8cbbd50c0)
Branch: v6.0
https://github.com/mongodb/mongo/commit/4a5058bdda8c0b8a81b963c482bca3393480ad06

Comment by Githook User [ 26/Jun/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Support drivers gossiping signed $clusterTimes to shardsvr replica set before and after addShard is run

(cherry picked from commit cd0be4aab3f70b77f07c582f9e05cabdd0264c3f)
Branch: v7.0
https://github.com/mongodb/mongo/commit/020fe38aad44906016ac06d5bf557cd02cd9ef3b

Comment by Githook User [ 26/Jun/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Make KeysCollectionClientSharded support returning external keys

(cherry picked from commit 6d077630145b3b9f1618710b575ad3cbd94386a2)
Branch: v7.0
https://github.com/mongodb/mongo/commit/a8a6d7272c064758c2b1943a33c31731861d3fc2

Comment by Githook User [ 26/Jun/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Move the helpers for creating and inserting external key documents into a util file

(cherry picked from commit 0c10ace040dd8cda65e3b0de5f6295e6f3f530c6)
Branch: v7.0
https://github.com/mongodb/mongo/commit/c0879fcf9a5f2129bf4e894131aa7ac8cbbd50c0

Comment by Githook User [ 09/Jun/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Support drivers gossiping signed $clusterTimes to shardsvr replica set before and after addShard is run
Branch: master
https://github.com/mongodb/mongo/commit/cd0be4aab3f70b77f07c582f9e05cabdd0264c3f

Comment by Githook User [ 08/Jun/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Make KeysCollectionClientSharded support returning external keys
Branch: master
https://github.com/mongodb/mongo/commit/6d077630145b3b9f1618710b575ad3cbd94386a2

Comment by Githook User [ 07/Jun/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-60466 Move the helpers for creating and inserting external key documents into a util file
Branch: master
https://github.com/mongodb/mongo/commit/0c10ace040dd8cda65e3b0de5f6295e6f3f530c6

Comment by Max Hirschhorn [ 25/Jan/22 ]

I put together the following JavaScript test to demonstrate the time window in which an application would receive CannotVerifyAndSignLogicalTime errors until either (a) the application servers are all restarted or (b) the addShard command is run by the operator.

The JavaScript test asserts the currently implemented semantics (and passes) because I wanted to cover both the CannotVerifyAndSignLogicalTime error and the KeyNotFound error. Hopefully the assertUndesiredBehavior variable and test comments clarify the details for anyone who is curious.

I also filed DOCS-15066 to have the MongoDB Manual updated to inform users of the application downtime when converting a replica set to a sharded cluster.

python buildscripts/resmoke.py run --suite=sharding repro_server60466.js

repro_server60466.js

(function() {
"use strict";
 
load("jstests/libs/fail_point_util.js");
load("jstests/multiVersion/libs/multi_rs.js");
 
// assertUndesiredBehavior is an alias to better indicate which parts of the behavior are expected
// to change after the issue described in SERVER-60466 is addressed.
const assertUndesiredBehavior = assert;
 
// We start a server with keyfile authentication enabled so the server will return signed
// $clusterTime values in its responses.
const numNodes = 3;
const keyFile = "jstests/libs/key1";
const rst = new ReplSetTest({nodes: numNodes, keyFile});
 
rst.startSet();
rst.initiate();
 
// We then create a user for running commands later on in the test. The choice of the "root" role is
// somewhat arbitrary and the important detail is the user lacks the advanceClusterTime privilege.
// This ensures the server won't return $clusterTime values signed with a dummy key.
(function createUser() {
    const primary = rst.getPrimary();
    primary.getDB("admin").createUser({user: "root", pwd: "root", roles: ["root"]},
                                      {w: rst.nodes.length});
})();
 
function authUser(conn) {
    assert(conn.getDB("admin").auth("root", "root"));
}
 
const userSessions = rst.nodes.map(node => {
    const conn = new Mongo(node.host);
    authUser(conn);
    return conn.startSession({causalConsistency: false, retryWrites: false});
});
 
function findSessionForPrimary() {
    const primary = rst.getPrimary();
    return userSessions.find(session => session.getClient().host === primary.host);
}
 
const clusterTime = (() => {
    const primarySession = findSessionForPrimary();
    assert.commandWorked(primarySession.getDatabase("test").getCollection("mycoll").insert({}));
    return primarySession.getClusterTime();
})();
 
for (let session of userSessions) {
    session.advanceClusterTime(clusterTime);
    assert.commandWorked(session.getDatabase("admin").runCommand("hello"));
}
 
// ReplSetTest.prototype.upgradeSet() is awkward to use when authentication is enabled. Some past
// work in jstests/multiVersion/load_keys_on_upgrade.js found it simplest to define the
// authentication settings on TestData so Mongo.prototype.getDB() takes care of re-authenticating
// after the network connection has been re-established.
function withTemporaryTestData(callback, mods = {}) {
    const original = TestData;
    try {
        TestData = Object.assign({}, TestData, mods);
        callback();
    } finally {
        TestData = original;
    }
}
 
withTemporaryTestData(() => {
    rst.upgradeSet({shardsvr: "", appendOptions: true});
}, {
    auth: true,
    keyFile,
    authUser: "__system",
    keyFileData: "foopdedoop",
    authenticationDatabase: "local"
});
 
for (let session of userSessions) {
    // Reconnect and re-authenticate after the network connection was closed from the server process
    // being restarted.
    const error = assert.throws(() => session.getDatabase("admin").runCommand("hello"));
    assert(isNetworkError(error), error);
    authUser(session.getClient());
 
    // Until the addShard command is run, a --shardsvr will return a CannotVerifyAndSignLogicalTime
    // error response to an application using a signed cluster time.
    assertUndesiredBehavior.commandFailedWithCode(session.getDatabase("admin").runCommand("hello"),
                                                  ErrorCodes.CannotVerifyAndSignLogicalTime);
    assertUndesiredBehavior.commandFailedWithCode(session.getDatabase("admin").runCommand("hello"),
                                                  ErrorCodes.CannotVerifyAndSignLogicalTime);
}
 
// Restarting the application is one way to address the CannotVerifyAndSignLogicalTime error because
// it'll clear the signed $clusterTime values the driver would be sending to the server.
{
    const primary = new Mongo(rst.getPrimary().host);
    assert(primary.getDB("admin").auth("root", "root"));
 
    const primarySession = primary.startSession({causalConsistency: false, retryWrites: false});
    assert.commandWorked(primarySession.getDatabase("test").getCollection("mycoll").insert({}));
    assert.commandWorked(primarySession.getDatabase("test").getCollection("mycoll").insert({}));
 
    assert.eq(undefined, primarySession.getClusterTime());
}
 
const st = new ShardingTest({mongos: 1, config: 1, shards: 0, other: {keyFile}});
 
assert.commandWorked(st.s.adminCommand({addShard: rst.getURL()}));
rst.awaitReplication();
 
for (let session of userSessions) {
    // As a performance optimization, LogicalTimeValidator::validate() skips validating $clusterTime
    // values which have a $clusterTime.clusterTime value smaller than the currently known signed
    // $clusterTime value. It is possible (but not strictly guaranteed) for internal communication
    // to have already happened between cluster members such that they all know about a signed
    // $clusterTime value. This signed $clusterTime value would come from the new signing key
    // generated by the config server primary. We use the alwaysValidateClientsClusterTime failpoint
    // to simulate the behavior of case when the internal communication with a signed $clusterTime
    // value hasn't happened yet.
    const fp = (() => {
        const fpConn = new Mongo(session.getClient().host);
        authUser(fpConn);
        return configureFailPoint(fpConn, "alwaysValidateClientsClusterTime");
    })();
 
    // The KeyNotFound error response from the server only happens once because the session advances
    // its notion of the $clusterTime value from the error response. The $clusterTime value in the
    // error response will have been signed with the new signing key generated by the config server
    // primary.
    assertUndesiredBehavior.commandFailedWithCode(session.getDatabase("admin").runCommand("hello"),
                                                  ErrorCodes.KeyNotFound);
 
    assert.commandWorked(session.getDatabase("admin").runCommand("hello"));
    assert.commandWorked(session.getDatabase("admin").runCommand("hello"));
 
    fp.off();
}
 
st.stop();
rst.stopSet();
})();

Comment by Thomas Danielsson [ 21/Oct/21 ]

Hi,

We were in the process of migrating our dev Replica Set to a Sharded cluster, and got hit by this issue causing all applications to lose their connection to the Replica Set, which made us revert  --shardsvr on all nodes.

Is there any workaround to perform the "Convert a Replica Set to a Sharded Cluster"-flow without all applications loosing their connections? Like give the mongodb user used by the application any elevated permissions temporarily during the conversion phase? We noticed our super-user didn't seem to be affected by this issue.

 

Comment by Max Hirschhorn [ 05/Oct/21 ]

As part of the work on this ticket we should add a version of the convert_to_and_from_sharded.js test which runs with auth enabled and uses a non-__system user. Note that while the convert_to_and_from_sharded.js test runs in the sharding_auth.yml test suite already, the auth passthrough suites use the __system user and clients with the advanceClusterTime privilege won't suffer from the CannotVerifyAndSignLogicalTime problem. See the preamble for the causally_consistent_jscore_passthrough_auth.yml test suite as an example of how this other auth user could be configured.

Generated at Thu Feb 08 05:49:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.