[SERVER-36739] Use the mongos_manual_intervention_action hook in concurrency stepdown suites Created: 17/Aug/18  Updated: 29/Oct/23  Resolved: 24/Apr/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.7.0, 4.4.2, 4.2.11, 4.0.22

Type: Task Priority: Major - P3
Reporter: Janna Golden Assignee: Misha Tyulenev
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4, v4.2, v4.0
Sprint: Sharding 2020-04-20, Sharding 2020-05-04
Participants:
Linked BF Score: 8

 Description   

Tests can fail in both concurrency_sharded_with_stepdowns and concurrency_sharded_with_stepdowns_and_balancer because of a failed shardCollection leaving partially written chunks. We should use the mongos_manual_intervention_action hook in both of these suites to catch and clean up this error.



 Comments   
Comment by Githook User [ 30/Oct/20 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-36739 Use the mongos_manual_intervention_action hook in concurrency stepdown suites

(cherry picked from commit c4873acda56712bba29f1ce1f81c6e8dc873669a)
Branch: v4.2
https://github.com/mongodb/mongo/commit/ac61ab5c094e159bddc40af74f09bae395bdd8c9

Comment by Githook User [ 30/Oct/20 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-36739 Use the mongos_manual_intervention_action hook in concurrency stepdown suites

(cherry picked from commit c4873acda56712bba29f1ce1f81c6e8dc873669a)
Branch: v4.0
https://github.com/mongodb/mongo/commit/592daad1b7f5038c86fa6979efa4c4a1fe07eae4

Comment by Githook User [ 30/Oct/20 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-36739 Use the mongos_manual_intervention_action hook in concurrency stepdown suites

(cherry picked from commit c4873acda56712bba29f1ce1f81c6e8dc873669a)
Branch: v4.4
https://github.com/mongodb/mongo/commit/388a6a2435eba24ccc5da1da2c874543ddad6a7a

Comment by Githook User [ 24/Apr/20 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-36739 Use the mongos_manual_intervention_action hook in concurrency stepdown suites
Branch: master
https://github.com/mongodb/mongo/commit/c4873acda56712bba29f1ce1f81c6e8dc873669a

Comment by Janna Golden [ 22/Aug/18 ]

max.hirschhorn, actually the command is sent with `RetryPolicy::kIdempotent`so we already retry.

Comment by Max Hirschhorn [ 22/Aug/18 ]

Max Hirschhorn - the shardCollection command doesn't return success, so it's retried.

I thought we had gotten further in running the test than we did because the error message says "A previous attempt to shard collection test23_fsmdb0.fsmcoll0" (emphasis mine) but the concurrency framework would only attempt to run the shardCollection command once for a particular namespace.

In the particular case of the BF linked, the primary shard (shard 0) sends cloneCollectionOptionsFromPrimaryShard() to shard 1 after creating chunks on itself, which fails with a "NotMaster" error thus failing the shardCollection command and leaving partially written chunks.

I've included the relevant log messages below. Is it not possible to retry the _shardsvrShardCollection command on NotMaster error responses?

[ShardedClusterFixture:job0:shard0:node2] 2018-08-09T21:42:09.158+0000 I COMMAND  [conn203] command test23_fsmdb0.fsmcoll0 appName: "MongoDB Shell" command: _shardsvrShardCollection { _shardsvrShardCollection: "test23_fsmdb0.fsmcoll0", key: { _id: "hashed" }, unique: false, numInitialChunks: 0, collation: {}, getUUIDfromPrimaryShard: true, lsid: { id: UUID("f55d46f1-cb67-4759-af66-718ca4482506") }, writeConcern: { w: "majority", wtimeout: 60000 }, $clusterTime: { clusterTime: Timestamp(1533850928, 10), signature: { hash: BinData(0, 9ACEC0B3D1A53AD99C8E70022BF9CA9BC55DAABD), keyId: 6587837876187168797 } }, $client: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "4.1.1-290-gb69f0e10fc" }, os: { type: "Windows", name: "Microsoft Windows Server 2008 R2", architecture: "x86_64", version: "6.1 SP1 (build 7601)" }, mongos: { host: "WIN-TBGR09QUU7D:20009", client: "127.0.0.1:59409", version: "4.1.1-290-gb69f0e10fc" } }, $configServerState: { opTime: { ts: Timestamp(1533850928, 10), t: 32 } }, $db: "admin" } planSummary: COUNT keysExamined:0 docsExamined:0 numYields:0 ok:0 errMsg:"Unable to create collection on shard-rs1 :: caused by :: not master" errName:NotMaster errCode:10107 reslen:447 locks:{ Global: { acquireCount: { r: 19, w: 8 } }, Database: { acquireCount: { r: 8, w: 7, W: 1 } }, Collection: { acquireCount: { r: 5, w: 1, W: 4 } }, oplog: { acquireCount: { w: 3 } } } protocol:op_msg 708ms
[ShardedClusterFixture:job0:configsvr:node1] 2018-08-09T21:42:09.159+0000 I NETWORK  [ShardRegistry] Marking host localhost:20005 as failed :: caused by :: NotMaster: Unable to create collection on shard-rs1 :: caused by :: not master

It is somewhat obnoxious that we jump through hoops in our testing infrastructure to ensure after each test in the stepdown suites that mongos knows the current primary of the CSRS and all the replica set shards, but the CSRS has exactly the same problem.

Comment by Janna Golden [ 22/Aug/18 ]

max.hirschhorn - the shardCollection command doesn't return success, so it's retried. In the particular case of the BF linked, the primary shard (shard 0) sends cloneCollectionOptionsFromPrimaryShard() to shard 1 after creating chunks on itself, which fails with a "NotMaster" error thus failing the shardCollection command and leaving partially written chunks.

Comment by Max Hirschhorn [ 22/Aug/18 ]

janna.golden, the concurrency framework runs the shardCollection command in the main thread before permitting stepdowns. Why would the _shardsvrShardCollection command be getting run multiple times after the shardCollection command returns success?

Generated at Thu Feb 08 04:43:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.