[SERVER-64610] Stale shardVersion error in catalog shard POC Created: 17/Mar/22  Updated: 29/Oct/23  Resolved: 04/May/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.1.0-rc0

Type: Task Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Andrew Shuvalov (Inactive)
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam2, sharding-nyc-subteam2-catalog-poc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-63598 Umbrella ticket for minimal POC for o... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding NYC 2022-04-04, Sharding NYC 2022-04-18, Sharding 2022-05-02, Sharding NYC 2022-05-16
Participants:
Story Points: 4

 Description   

Update: root cause:

The stale DB error generated by the DatabaseShardingState is supposed to be resolved by internal retry inside the ExecCommandDatabase::_commandExec() by handling the StaleDbVersion error and retrying it by calling refreshDatabase() and then recursively calling _commandExec().

This logic was gated by checking this is not config server, because we do not have the config server as primary for any DB. The fix posted is to handle catalog server differently from the standalone config server.

Repro:

buildscripts/resmoke.py run --suite sharded_jscore_txns --numShards=1 --numReplSetNodes=3 --catalogShard=any jstests/core/rename_collection_long_name.js

Error:

[js_test:rename_collection_long_name] uncaught exception: Error: listIndexes failed: {
[js_test:rename_collection_long_name] 	"ok" : 0,
[js_test:rename_collection_long_name] 	"errmsg" : "got stale shardVersion response from shard shard-rs0 at host localhost:20000 :: caused by :: sharding status of collection test.renameSRC is not currently known and needs to be recovered",
[js_test:rename_collection_long_name] 	"code" : 13388,
[js_test:rename_collection_long_name] 	"codeName" : "StaleConfig",
[js_test:rename_collection_long_name] 	"ns" : "test.renameSRC",
[js_test:rename_collection_long_name] 	"vReceived" : Timestamp(0, 0),
[js_test:rename_collection_long_name] 	"vReceivedEpoch" : ObjectId("000000000000000000000000"),
[js_test:rename_collection_long_name] 	"vReceivedTimestamp" : Timestamp(0, 0),
[js_test:rename_collection_long_name] 	"shardId" : "shard-rs0",
[js_test:rename_collection_long_name] 	"$clusterTime" : {
[js_test:rename_collection_long_name] 		"clusterTime" : Timestamp(1647532487, 45),
[js_test:rename_collection_long_name] 		"signature" : {
[js_test:rename_collection_long_name] 			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
[js_test:rename_collection_long_name] 			"keyId" : NumberLong(0)
[js_test:rename_collection_long_name] 		}
[js_test:rename_collection_long_name] 	},
[js_test:rename_collection_long_name] 	"operationTime" : Timestamp(1647532487, 45)
[js_test:rename_collection_long_name] } :
[js_test:rename_collection_long_name] _getErrorWithCode@src/mongo/shell/utils.js:24:13
[js_test:rename_collection_long_name] DBCollection.prototype.getIndexes@src/mongo/shell/collection.js:753:15
[js_test:rename_collection_long_name] @jstests/core/rename_collection_long_name.js:34:37
[js_test:rename_collection_long_name] @jstests/core/rename_collection_long_name.js:43:3



 Comments   
Comment by Andrew Shuvalov (Inactive) [ 04/May/22 ]

There is a different failure in this test now, has its own ticket. The original failure was fixed.

Comment by Kaloian Manassiev [ 17/Mar/22 ]

Is that a bug or something that you want to implement?

Generated at Thu Feb 08 06:00:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.