[SERVER-46752] moveChunk will keep on returning ShardNotFound until DatabaseVersion is updated Created: 10/Mar/20  Updated: 06/Dec/22  Resolved: 11/Mar/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.0.16
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File test.js    
Issue Links:
Duplicate
duplicates SERVER-32871 ReplicaSetMonitorRemoved and ShardNot... Closed
Assigned Teams:
Sharding
Operating System: ALL
Participants:
Case:

 Description   

At the beginning of chunk migration, we do a force refresh of the collection metadata, this will eventually call CatalogCache::_getCollectionRoutingInfoAt, which will then call CatalogCache::getDatabase. getDatabase, however will always try to call ShardRegistry::getShard of what it thinks is the current primary database. If the shard has already been removed, then it error out with ShardNotFound.

To get around this issue, send a flushRouterConfig to the affected shard.



 Comments   
Comment by Randolph Tan [ 11/Mar/20 ]

Oh, that would explain it. The moveChunk command does a reload on the shard registry at the beginning, which will make it remove the shard and mark the database as invalidated.

Comment by Esha Maharishi (Inactive) [ 10/Mar/20 ]

Maybe it was fixed in 4.4 by SERVER-32871?

Comment by Randolph Tan [ 10/Mar/20 ]

Attached a js test demonstrating issue. Can easily reproduce in v4.0, but doesn't appear to fail in current master. Must investigate if it's already fixed or it's just harder for the bug to surface.

Generated at Thu Feb 08 05:12:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.