[SERVER-62217] inject_tenant_prefix.js override may not wait for newly elected primary after tenant migration commits Created: 22/Dec/21  Updated: 29/Oct/23  Resolved: 21/Jan/22

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 5.3.0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Hugh Tong (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.2, v5.0
Sprint: Server Serverless 2022-01-10, Server Serverless 2022-01-24
Participants:
Linked BF Score: 15

 Description   

The inject_tenant_prefix.js override uses a Mongo.prototype.runCommand() override to send a command over the reroutingMongo replica set connection. However, Mongo.prototype._markNodeAsFailed() isn't being overridden by this file, so when the command fails with a retryable error from the reroutingMongo replica set connection, _markNodeAsFailed() is still called on the original replica set connection. The RSM for the original replica set connection ignores the error and won't attempt to discover a new primary because the host isn't actually part of that replica set. This leads the JavaScript test to continue its retries unsuccessfully against the stale primary of the rerouting replica set connection and ultimately fail the test.

[js_test:index_check6] | 2021-12-17T13:08:08.579Z I  NETWORK  4712102 [js] "Host failed in replica set","attr":{"replicaSet":"rs0","host":"localhost:20753","error":{"code":91,"codeName":"ShutdownInProgress","errmsg":"Replication is being shut down"},"action":{"dropConnections":true,"requestImmediateCheck":true,"outcome":{"host":"localhost:20753","success":false,"errorMessage":"ShutdownInProgress: Replication is being shut down"}}}
[js_test:index_check6] | 2021-12-17T13:08:08.579Z I  CONNPOOL 22572   [js] "Dropping all pooled connections","attr":{"hostAndPort":"localhost:20753","error":"PooledConnectionsDropped: Pooled connections dropped"}
[js_test:index_check6] | 2021-12-17T13:08:08.579Z I  -        4333227 [js] "RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20751","replicaSet":"rs0"}
[js_test:index_check6] | 2021-12-17T13:08:08.579Z I  -        4333227 [js] "RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20752","replicaSet":"rs0"}
[js_test:index_check6] | 2021-12-17T13:08:08.579Z I  -        4333227 [js] "RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20750","replicaSet":"rs0"}
[js_test:index_check6] | 2021-12-17T13:08:08.580Z I  NETWORK  20219   [js] "Ignoring isMaster reply from server that is not in the topology","attr":{"serverAddress":"localhost:20753"}
...
[j3:rs1:n0] | 2021-12-17T13:08:08.575+00:00 I  REPL     21339   [conn102] "Replication failed for write concern","attr":{"status":"ShutdownInProgress: Replication is being shut down","writeConcern":{"w":"majority","j":true,"wtimeout":300321,"provenance":"clientSupplied"},"opTime":{"ts":{"$timestamp":{"t":1639746488,"i":19}},"t":7},"opID":4283,"allDurable":{"$timestamp":{"t":1639746488,"i":19}},"progress":{"lastCommittedOpTime":{"ts":{"$timestamp":{"t":1639746488,"i":17}},"t":7},"currentCommittedSnapshotOpTime":{"ts":{"$timestamp":{"t":1639746488,"i":17}},"t":7},"replicationProgress":[{"host":"localhost:20753","optime":{"ts":{"$timestamp":{"t":1639746488,"i":19}},"t":7},"lastAppliedOpTime":{"ts":{"$timestamp":{"t":1639746488,"i":19}},"t":7},"heartbeatAppliedOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1},"heartbeatDurableOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1},"memberId":0},{"host":"localhost:20754","optime":{"ts":{"$timestamp":{"t":1639746488,"i":17}},"t":7},"lastAppliedOpTime":{"ts":{"$timestamp":{"t":1639746488,"i":19}},"t":7},"heartbeatAppliedOpTime":{"ts":{"$timestamp":{"t":1639746487,"i":7}},"t":7},"heartbeatDurableOpTime":{"ts":{"$timestamp":{"t":1639746487,"i":7}},"t":7},"memberId":1},{"host":"localhost:20755","optime":{"ts":{"$timestamp":{"t":1639746488,"i":17}},"t":7},"lastAppliedOpTime":{"ts":{"$timestamp":{"t":1639746488,"i":17}},"t":7},"heartbeatAppliedOpTime":{"ts":{"$timestamp":{"t":1639746487,"i":7}},"t":7},"heartbeatDurableOpTime":{"ts":{"$timestamp":{"t":1639746487,"i":7}},"t":7},"memberId":2}]}}

https://evergreen.mongodb.com/lobster/build/aae27bfca51286989a2117a9a5610238/all#bookmarks=0%2C50656%2C50659%2C50664%2C50691%2C50694%2C50698%2C50723%2C50727%2C50728%2C50729%2C50731%2C50734%2C50738%2C50767%2C176605&f~=000~%5C%5Bjs_test%3Aindex_check6%5C%5D&f~=000~ShutdownInProgress&shareLine=50656



 Comments   
Comment by Githook User [ 03/Feb/22 ]

Author:

{'name': 'Hugh Tong', 'email': 'hugh.tong@mongodb.com', 'username': 'cortrain'}

Message: SERVER-62217 Use runCommand override to create _markNodeAsFailed override
Branch: v5.2
https://github.com/mongodb/mongo/commit/0c136160c8c958b56a53b05da63201811eaeac50

Comment by Hugh Tong (Inactive) [ 18/Jan/22 ]

Completed on 5.3.0

Comment by Githook User [ 14/Jan/22 ]

Author:

{'name': 'Hugh Tong', 'email': 'hugh.tong@mongodb.com', 'username': 'cortrain'}

Message: SERVER-62217 Use runCommand override to create _markNodeAsFailed override
Branch: master
https://github.com/mongodb/mongo/commit/2c2b30be40afa01b09440ec72626c26a3cb4b9ed

Comment by Hugh Tong (Inactive) [ 13/Jan/22 ]

Created an override in inject_tenant_prefix.js, and performed evergreen tests.  In one of the instances of the tenant_migration_terminate_primary_jscore_passthrough test suite, we can see the following sequence of events:

[j0:rs1:n2] | 2022-01-13T06:11:15.933+00:00 D4 TXN      23984   [conn100] "New transaction started","attr":{"txnNumberAndRetryCounter":{"txnNumber":784,"txnRetryCounter":-1},"lsid":{"uuid":{"$uuid":"ee829ef7-e037-4b7d-980a-77adeec56c78"}},"apiParameters":{}}
 
[js_test:index_check6] [jsTest] Retrying insert with original command request: {
[js_test:index_check6] [jsTest]     "insert" : "jstests_index_check6_1",
[js_test:index_check6] [jsTest]     "documents" : [
[js_test:index_check6] [jsTest]         {
[js_test:index_check6] [jsTest]             "_id" : ObjectId("61dfc4251cc12f30c01e97cb"),
[js_test:index_check6] [jsTest]             "a" : 4,
[js_test:index_check6] [jsTest]             "b" : 8,
[js_test:index_check6] [jsTest]             "c" : 4
[js_test:index_check6] [jsTest]         }
[js_test:index_check6] [jsTest]     ],
[js_test:index_check6] [jsTest]     "ordered" : true,
[js_test:index_check6] [jsTest]     "txnNumber" : NumberLong(784),
[js_test:index_check6] [jsTest]     "lsid" : {
[js_test:index_check6] [jsTest]         "id" : UUID("44b1d8e8-93e8-4b6c-92f5-08d3668f1074")
[js_test:index_check6] [jsTest]     },
[js_test:index_check6] [jsTest]     "$clusterTime" : {
[js_test:index_check6] [jsTest]         "clusterTime" : Timestamp(1642054693, 9),
[js_test:index_check6] [jsTest]         "signature" : {
[js_test:index_check6] [jsTest]             "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
[js_test:index_check6] [jsTest]             "keyId" : NumberLong(0)
[js_test:index_check6] [jsTest]         }
[js_test:index_check6] [jsTest]     }
[js_test:index_check6] [jsTest] } due to retryable error (code=10107), subsequent retries remaining: 2
 
[j0:rs1:n0] | 2022-01-13T06:18:13.965+00:00 D4 TXN      23984   [conn53] "New transaction started","attr":{"txnNumberAndRetryCounter":{"txnNumber":784,"txnRetryCounter":-1},"lsid":{"uuid":{"$uuid":"44b1d8e8-93e8-4b6c-92f5-08d3668f1074"}},"apiParameters":{}}
 
[j0:rs1:n0] | 2022-01-13T06:18:15.331+00:00 I  COMMAND  51803   [conn53] "Slow query","attr":{"type":"command","ns":"tenantMigrationTenantId_test.jstests_index_check6_1","appName":"MongoDB Shell","command":{"insert":"jstests_index_check6_1","ordered":true,"txnNumber":784,"lsid":{"id":{"$uuid":"44b1d8e8-93e8-4b6c-92f5-08d3668f1074"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1642054693,"i":9}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"writeConcern":{"w":"majority","wtimeout":300321},"comment":{"isCmdObjWithTenantId":true},"$db":"tenantMigrationTenantId_test"},"ninserted":1,"keysInserted":1,"numYields":0,"queryHash":"0ACF498E","planCacheKey":"0ACF498E","reslen":230,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":3}},"ReplicationStateTransition":{"acquireCount":{"w":5}},"Global":{"acquireCount":{"r":3,"w":2}},"Database":{"acquireCount":{"w":2}},"Collection":{"acquireCount":{"w":2}},"Mutex":{"acquireCount":{"r":3}}},"flowControl":{"acquireCount":1,"timeAcquiringMicros":5},"readConcern":{"provenance":"implicitDefault"},"writeConcern":{"w":"majority","wtimeout":300321,"provenance":"clientSupplied"},"storage":{},"remote":"127.0.0.1:41446","protocol":"op_msg","durationMillis":1377}

In this test, we see an insert ("txnNumber":784) being applied on j0:rs1:n2, then failing during a primary stepdown on rs1.  If _markNodeAsFailed() were successful, j0:rs1:n2 should be marked as failed, and the operation retried on the new primary in the next attempt on the same txnNumber.  Following the failure, we see the operation being applied to j0:rs1:n0 ("txnNumber":784), and after that the Slow Query response with "ninserted":1, indicating a successful insert.

 

Generated at Thu Feb 08 05:54:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.