[SERVER-32871] ReplicaSetMonitorRemoved and ShardNotFound errors on fanout query after removing a shard Created: 24/Jan/18 Updated: 30/Oct/23 Resolved: 09/Sep/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.4.18, 3.6.9 |
| Fix Version/s: | 4.3.1, 4.2.6 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Bartley | Assignee: | Matthew Saltz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ShardingRoughEdges, high-value | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.2, v4.0
|
||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: |
|
||||||||||||||||||||||||||||||||||||||||
| Sprint: | Sharding 2019-09-09 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
We've noticed that after removing a shard, fanout queries (e.g. issue a collection count against a sharded collection) will return ReplicaSetMonitorRemoved or ShardNotFound errors. While investigating, it looks like the internal chunk cache has an old config (getShardVersion on the collection returns an old version). It appears that as long as no non-fanout queries (or inserts/removes) are issued after the remove has completed, fanout queries on some mongos have a relatively high chance of consistently failing. |
| Comments |
| Comment by Githook User [ 13/Apr/20 ] | ||||||||||||||||||||||||||||||||
|
Author: {'name': 'Matthew Saltz', 'email': 'matthew.saltz@mongodb.com', 'username': 'saltzm'}Message: (cherry picked from commit 6ea81c883e7297be99884185c908c7ece385caf8) | ||||||||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 09/Sep/19 ] | ||||||||||||||||||||||||||||||||
|
alyson.cabral, how far back do we want to backport this change to mongos? | ||||||||||||||||||||||||||||||||
| Comment by Githook User [ 06/Sep/19 ] | ||||||||||||||||||||||||||||||||
|
Author: {'name': 'Matthew Saltz', 'username': 'saltzm', 'email': 'matthew.saltz@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 12/Dec/18 ] | ||||||||||||||||||||||||||||||||
|
esha.maharishi, what Aly is suggesting is exactly what was your idea. Was the lack of the namespace the only problem that you encountered? Because given how rare ShardNotFound is we can just throw out the entire CatalogCache, because who knows what other collections also have reference to the removed shard anyways. EDIT: I take that back. It looks like we also use ShardNotFound in other cases where the user specifies the shard id (such as moveChunk, movePrimary, removeShard, etc). Treating ShardNotFound the same for these cases is not a good idea because we will be throwing out the catalog cache unnecessarily. | ||||||||||||||||||||||||||||||||
| Comment by Alyson Cabral (Inactive) [ 12/Dec/18 ] | ||||||||||||||||||||||||||||||||
|
I've added the ShardingRoughEdges and high-value labels to consider next quarter. This seems like it could be a sustained problem for zone sharding topologies with regional mongos routers. These routers may continue to target a shard that was removed without discovering a new shard version from other zones. Another example of where this could be a sustained problem is for microservices with sidecar mongos routers that target a small percentage of the data, potentially even just a small unsharded collection in a larger sharded cluster. One thing I would like us to consider when we pick up this ticket again is whether we can always just pull the new chunk map on ShardNotFound errors, as we expect this to be a very rare error. | ||||||||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 07/Dec/18 ] | ||||||||||||||||||||||||||||||||
|
Never mind, the fix is not so simple. I tried this diff:
But it's not enough - in the place in command dispatch on the router where stale version exceptions are caught so that the routing table cache entry can be invalidated and the command retried, the exception needs to contain a namespace to mark as invalid, but ShardNotFound is not (currently) associated with a namespace:
alyson.cabral, how should we re-prioritize this? Maybe under ShardingRoughEdges? | ||||||||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 07/Dec/18 ] | ||||||||||||||||||||||||||||||||
|
This happens because, after the shard is removed, the ShardRegistry (cache of the shard list) on each router will soon discover through its periodic refresh that the shard has been removed and will remove the entry for the shard from the cache. In order to remove a shard, all chunks that lived on the shard must be moved off. However, the mongos will continue to target the removed shard until the mongos's routing table cache is refreshed, and thus will encounter the ShardNotFound or ReplicaSetMonitorRemoved error. Until this is fixed, if this issue is encountered, it can be worked around by flushing the routing table cache on the router. A fix proposed by matthew.saltz is for a router to treat ShardNotFound and ReplicaSetMonitorRemoved errors as "stale version" errors, so that on these errors, the router marks its routing table cache as invalid and retries the request with a refreshed routing table cache. This is a pretty small fix, so I'll give it a try. | ||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 21/Nov/18 ] | ||||||||||||||||||||||||||||||||
|
I've confirmed that remove4.js | ||||||||||||||||||||||||||||||||
| Comment by David Bartley [ 16/Nov/18 ] | ||||||||||||||||||||||||||||||||
|
Do the tests not reproduce the issue? We noticed this in our initial evaluation/stress test of MongoDB sharded cluster and haven't retested since, but I'd have no reason to believe it's not still a problem? | ||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 15/Nov/18 ] | ||||||||||||||||||||||||||||||||
|
bartle, looks like we let this ticket fall through the cracks – my apologies. Is this still an issue for you? At the time I was not able to reproduce the behavior you described, but we can make another pass if you're still experiencing issues. Thanks, | ||||||||||||||||||||||||||||||||
| Comment by David Bartley [ 12/Feb/18 ] | ||||||||||||||||||||||||||||||||
|
That failing test specifically waits for removeShard to return with state: "completed", and is based on a similar test that does the same. Even if it didn't, it still seems like ongoing queries to mongoS should continue to work throughout the entire removal operation? | ||||||||||||||||||||||||||||||||
| Comment by Mark Agarunov [ 12/Feb/18 ] | ||||||||||||||||||||||||||||||||
|
Hello bartle, Thank you for the additional information and providing the repro script. Looking at the script, it appears that it may not be waiting for the shard to drain, and seems to execute commands in a different order than specified in the documentation. If possible, could you please manually remove the shard and see if this behavior persists? Thanks, | ||||||||||||||||||||||||||||||||
| Comment by David Bartley [ 25/Jan/18 ] | ||||||||||||||||||||||||||||||||
|
This is also reproducible on 3.6.2 with a slightly modified test (after removing a shard, stop the shard replset). | ||||||||||||||||||||||||||||||||
| Comment by David Bartley [ 25/Jan/18 ] | ||||||||||||||||||||||||||||||||
|
Hi Mark, thanks for looking into this! I attached a failing jstest and logs from that run. The test case is a bit more pathological than what I originally reported (start with 2 shards, add a 3rd, move the chunks from the first 2 shards to the 3rd, remove the first 2 shards, then issue a query that would have been routed to one of the now-removed shards if the cache were stale), but it fails 100% of the time. | ||||||||||||||||||||||||||||||||
| Comment by Mark Agarunov [ 24/Jan/18 ] | ||||||||||||||||||||||||||||||||
|
Hello bartle, Thank you for the report. To get a better idea of what part of the process is causing this behavior, could you please provide the complete output of the process to remove a shard as well as the complete log files from the affected mongod and mongos nodes. This should give us some insight into which component is the source of the behavior. Thanks, |