[SERVER-46752] moveChunk will keep on returning ShardNotFound until DatabaseVersion is updated Created: 10/Mar/20 Updated: 06/Dec/22 Resolved: 11/Mar/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.0.16 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Randolph Tan | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Sharding
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
|
At the beginning of chunk migration, we do a force refresh of the collection metadata, this will eventually call CatalogCache::_getCollectionRoutingInfoAt, which will then call CatalogCache::getDatabase. getDatabase, however will always try to call ShardRegistry::getShard of what it thinks is the current primary database. If the shard has already been removed, then it error out with ShardNotFound. To get around this issue, send a flushRouterConfig to the affected shard. |
| Comments |
| Comment by Randolph Tan [ 11/Mar/20 ] |
|
Oh, that would explain it. The moveChunk command does a reload on the shard registry at the beginning, which will make it remove the shard and mark the database as invalidated. |
| Comment by Esha Maharishi (Inactive) [ 10/Mar/20 ] |
|
Maybe it was fixed in 4.4 by |
| Comment by Randolph Tan [ 10/Mar/20 ] |
|
Attached a js test demonstrating issue. Can easily reproduce in v4.0, but doesn't appear to fail in current master. Must investigate if it's already fixed or it's just harder for the bug to surface. |