[SERVER-32871] ReplicaSetMonitorRemoved and ShardNotFound errors on fanout query after removing a shard Created: 24/Jan/18  Updated: 30/Oct/23  Resolved: 09/Sep/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.18, 3.6.9
Fix Version/s: 4.3.1, 4.2.6

Type: Bug Priority: Major - P3
Reporter: David Bartley Assignee: Matthew Saltz (Inactive)
Resolution: Fixed Votes: 0
Labels: ShardingRoughEdges, high-value
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File logs-3.6.txt     Text File logs.txt     File remove4-3.6.js     File remove4.js    
Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-18175 Chunk manager references removed shar... Closed
is duplicated by SERVER-30701 ChunkManager can contain chunks that ... Closed
is duplicated by SERVER-46752 moveChunk will keep on returning Shar... Closed
Related
related to SERVER-39205 Force the cache refresh after the sha... Closed
related to SERVER-40709 CatalogCache should mark database ent... Closed
is related to SERVER-43197 Improve testing for ShardRegistry rel... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2, v4.0
Steps To Reproduce:
  • Shard a mongo collection, test.test, across shard1, shard2, etc...
  • Make sure that no queries/inserts/... to test.test occur during/after the shard removal
  • On mongos1, {removeShard: "shard1"}
  • Wait until the removal is complete (i.e. removeShard indicates the removal is complete)
  • On each of the mongos's, call db.test.count()
Sprint: Sharding 2019-09-09
Participants:
Case:

 Description   

We've noticed that after removing a shard, fanout queries (e.g. issue a collection count against a sharded collection) will return ReplicaSetMonitorRemoved or ShardNotFound errors. While investigating, it looks like the internal chunk cache has an old config (getShardVersion on the collection returns an old version). It appears that as long as no non-fanout queries (or inserts/removes) are issued after the remove has completed, fanout queries on some mongos have a relatively high chance of consistently failing.



 Comments   
Comment by Githook User [ 13/Apr/20 ]

Author:

{'name': 'Matthew Saltz', 'email': 'matthew.saltz@mongodb.com', 'username': 'saltzm'}

Message: SERVER-32871 Invalidate CatalogCache entries when ShardRegistry reload discovers a shard has been removed

(cherry picked from commit 6ea81c883e7297be99884185c908c7ece385caf8)
Branch: v4.2
https://github.com/mongodb/mongo/commit/fafc69977611a6ee306f3e8dd2b8bad065556d29

Comment by Esha Maharishi (Inactive) [ 09/Sep/19 ]

alyson.cabral, how far back do we want to backport this change to mongos?

Comment by Githook User [ 06/Sep/19 ]

Author:

{'name': 'Matthew Saltz', 'username': 'saltzm', 'email': 'matthew.saltz@mongodb.com'}

Message: SERVER-32871 Invalidate CatalogCache entries when ShardRegistry reload discovers a shard has been removed
Branch: master
https://github.com/mongodb/mongo/commit/6ea81c883e7297be99884185c908c7ece385caf8

Comment by Kaloian Manassiev [ 12/Dec/18 ]

esha.maharishi, what Aly is suggesting is exactly what was your idea. Was the lack of the namespace the only problem that you encountered? Because given how rare ShardNotFound is we can just throw out the entire CatalogCache, because who knows what other collections also have reference to the removed shard anyways.

EDIT: I take that back. It looks like we also use ShardNotFound in other cases where the user specifies the shard id (such as moveChunk, movePrimary, removeShard, etc). Treating ShardNotFound the same for these cases is not a good idea because we will be throwing out the catalog cache unnecessarily.

Comment by Alyson Cabral (Inactive) [ 12/Dec/18 ]

I've added the ShardingRoughEdges and high-value labels to consider next quarter.

This seems like it could be a sustained problem for zone sharding topologies with regional mongos routers. These routers may continue to target a shard that was removed without discovering a new shard version from other zones. Another example of where this could be a sustained problem is for microservices with sidecar mongos routers that target a small percentage of the data, potentially even just a small unsharded collection in a larger sharded cluster.

One thing I would like us to consider when we pick up this ticket again is whether we can always just pull the new chunk map on ShardNotFound errors, as we expect this to be a very rare error.

Comment by Esha Maharishi (Inactive) [ 07/Dec/18 ]

Never mind, the fix is not so simple.

I tried this diff:

diff --git a/src/mongo/base/error_codes.err b/src/mongo/base/error_codes.err
index ca74eb7..9f2980e 100644
--- a/src/mongo/base/error_codes.err
+++ b/src/mongo/base/error_codes.err
@@ -321,7 +321,7 @@ error_class("NotMasterError", [
 error_class("StaleShardVersionError",
             ["StaleConfig", "StaleShardVersion", "StaleEpoch"])
 error_class("NeedRetargettingError",
-            ["StaleConfig", "StaleShardVersion", "StaleEpoch", "CannotImplicitlyCreateCollection"])
+            ["StaleConfig", "StaleShardVersion", "StaleEpoch", "CannotImplicitlyCreateCollection", "ShardNotFound"])
 error_class("WriteConcernError", ["WriteConcernFailed",
                                   "WriteConcernLegacyOK",
                                   "UnknownReplWriteConcern",

But it's not enough - in the place in command dispatch on the router where stale version exceptions are caught so that the routing table cache entry can be invalidated and the command retried, the exception needs to contain a namespace to mark as invalid, but ShardNotFound is not (currently) associated with a namespace:

            } catch (const ExceptionForCat<ErrorCategory::NeedRetargettingError>& ex) {
                const auto staleNs = [&] {
                    if (auto staleInfo = ex.extraInfo<StaleConfigInfo>()) {
                        return staleInfo->getNss();
                    } else if (auto implicitCreateInfo =
                                   ex.extraInfo<CannotImplicitlyCreateCollectionInfo>()) {
                        // Requests that attempt to implicitly create a collection in a transaction
                        // should always fail with OperationNotSupportedInTransaction - this
                        // assertion is only meant to safeguard that assumption.
                        uassert(50983,
                                str::stream() << "Cannot handle exception in a transaction: "
                                              << ex.toStatus(),
                                !TransactionRouter::get(opCtx));
 
                        return implicitCreateInfo->getNss();
                    } else {
                        throw;
                    }
                }();

alyson.cabral, how should we re-prioritize this? Maybe under ShardingRoughEdges?

Comment by Esha Maharishi (Inactive) [ 07/Dec/18 ]

This happens because, after the shard is removed, the ShardRegistry (cache of the shard list) on each router will soon discover through its periodic refresh that the shard has been removed and will remove the entry for the shard from the cache.

In order to remove a shard, all chunks that lived on the shard must be moved off. However, the mongos will continue to target the removed shard until the mongos's routing table cache is refreshed, and thus will encounter the ShardNotFound or ReplicaSetMonitorRemoved error.

Until this is fixed, if this issue is encountered, it can be worked around by flushing the routing table cache on the router.

A fix proposed by matthew.saltz is for a router to treat ShardNotFound and ReplicaSetMonitorRemoved errors as "stale version" errors, so that on these errors, the router marks its routing table cache as invalid and retries the request with a refreshed routing table cache. This is a pretty small fix, so I'll give it a try.

Comment by Danny Hatcher (Inactive) [ 21/Nov/18 ]

I've confirmed that remove4.js encounters an error on 3.4.18 and remove4-3.6.js encounters an error on 3.6.9.

Comment by David Bartley [ 16/Nov/18 ]

Do the tests not reproduce the issue?  We noticed this in our initial evaluation/stress test of MongoDB sharded cluster and haven't retested since, but I'd have no reason to believe it's not still a problem?

Comment by Ramon Fernandez Marina [ 15/Nov/18 ]

bartle, looks like we let this ticket fall through the cracks – my apologies. Is this still an issue for you? At the time I was not able to reproduce the behavior you described, but we can make another pass if you're still experiencing issues.

Thanks,
Ramón.

Comment by David Bartley [ 12/Feb/18 ]

That failing test specifically waits for removeShard to return with state: "completed", and is based on a similar test that does the same. Even if it didn't, it still seems like ongoing queries to mongoS should continue to work throughout the entire removal operation?

Comment by Mark Agarunov [ 12/Feb/18 ]

Hello bartle,

Thank you for the additional information and providing the repro script. Looking at the script, it appears that it may not be waiting for the shard to drain, and seems to execute commands in a different order than specified in the documentation. If possible, could you please manually remove the shard and see if this behavior persists?

Thanks,
Mark

Comment by David Bartley [ 25/Jan/18 ]

This is also reproducible on 3.6.2 with a slightly modified test (after removing a shard, stop the shard replset).

Comment by David Bartley [ 25/Jan/18 ]

Hi Mark, thanks for looking into this!

I attached a failing jstest and logs from that run. The test case is a bit more pathological than what I originally reported (start with 2 shards, add a 3rd, move the chunks from the first 2 shards to the 3rd, remove the first 2 shards, then issue a query that would have been routed to one of the now-removed shards if the cache were stale), but it fails 100% of the time.

Comment by Mark Agarunov [ 24/Jan/18 ]

Hello bartle,

Thank you for the report. To get a better idea of what part of the process is causing this behavior, could you please provide the complete output of the process to remove a shard as well as the complete log files from the affected mongod and mongos nodes. This should give us some insight into which component is the source of the behavior.

Thanks,
Mark

Generated at Thu Feb 08 04:31:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.