[SERVER-22485] ShardNotFound error when looking up replica set with hosts in a different order than is stored in the ShardRegistry Created: 05/Feb/16  Updated: 06/Dec/22  Resolved: 15/Nov/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Shakir Sadikali Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Done Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File configOrderRepro.js    
Issue Links:
Depends
depends on SERVER-21906 Race in ShardRegistry::reload and con... Closed
depends on SERVER-22556 Get rid of DBClientReplicaSet Closed
Duplicate
is duplicated by SERVER-28399 "Shard not found for server" error wh... Closed
is duplicated by SERVER-30917 mongodb3.2 version can't count meteda... Closed
Related
related to SERVER-22862 Deadlock between ReplicaSetMonitor up... Closed
related to DOCS-7177 mongos config string order Closed
related to SERVER-30603 mongos shouldn't log the warning in t... Closed
Assigned Teams:
Sharding EMEA
Operating System: ALL
Sprint: Sharding 10 (02/19/16), Sharding 11 (03/11/16), Sharding 12 (04/01/16), Sharding 16 (06/24/16), Sharding 18 (08/05/16)
Participants:
Case:

 Description   

We have a 4 shard cluster. We added 4 new shards. All operations that need to go against the entire cluster fail with errors of the following form.

MongoDB Enterprise mongos> db.col.find({fplID:"301806213",phase:"C"}).explain()
2016-02-01T11:07:36.498-0500 E QUERY    [thread1] Error: explain failed: {
        "code" : 70,
        "ok" : 0,
        "errmsg" : "Shard not found for server: amsp06xdt/goxsd3396:10051,goxsd3397:10021"
} :
_getErrorWithCode@src/mongo/shell/utils.js:23:13
throwOrReturn@src/mongo/shell/explainable.js:34:1
constructor/this.finish@src/mongo/shell/explain_query.js:176:24
DBQuery.prototype.explain@src/mongo/shell/query.js:497:12
@(shell):1:1
 
MongoDB Enterprise mongos>

Bouncing the mongos does not resolve the issue.
We do not believe we are encountering SERVER-21906 .



 Comments   
Comment by Kaloian Manassiev [ 15/Nov/21 ]

With the throw-out of the legacy shard versioning path in 4.0 and later, this reverse lookup is no longer happening, so the order problem has gone away.

Comment by Andy Schwerin [ 14/Jul/16 ]

I'm putting this into "debugging with submitter", while misha.tyulenev investigates the risk of a fix.

Comment by Spencer Brody (Inactive) [ 12/Jul/16 ]

attached test that repros the issue on 3.2

Comment by Spencer Brody (Inactive) [ 14/Apr/16 ]

Haven't seen this happening on 3.2 since SERVER-21906 went in. Tentatively removing the plan to fix this for 3.2, unless we see more reports of it in the wild. Instead focusing this ticket on fixing this for 3.4 as part of the larger refactoring of the ShardRegistry.

Comment by Randolph Tan [ 10/Feb/16 ]

Note: The recent changes in master (SERVER-21906) made this harder to manifest.

Comment by Randolph Tan [ 09/Feb/16 ]

Note: it looks like this only affects code that uses ParallelSortClusteredCursor (most commands) and not the new AsyncResultsMerger (new find command).

Comment by Randolph Tan [ 09/Feb/16 ]

The issue is that calls to the _shardingRequestMetadataWriter/_shardingReplyMetadataReader is passing the full connection string here:

https://github.com/mongodb/mongo/blob/r3.3.1/src/mongo/client/dbclientcursor.cpp#L81
https://github.com/mongodb/mongo/blob/r3.3.1/src/mongo/client/dbclientcursor.cpp#L257-L258

This is problematic if the connection string is a replica set since the internal map does not contain all possible orderings of the replica set node in the connection string format. This means that if the string was stored in the map as "set/host1,host2" a lookup with "set/host2,host1" will not find the desired entry.

Generated at Thu Feb 08 04:00:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.