[SERVER-34632] config.chunks change to config.cache.chunks creates a collection long name after upgrade Created: 24/Apr/18  Updated: 26/Oct/23

Status: Backlog
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Shay Assignee: Backlog - Catalog and Routing
Resolution: Unresolved Votes: 4
Labels: SSCCL-BUG, oldshardingemea
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-47372 config.cache collections can remain e... Closed
is depended on by SERVER-43217 Secondaries can hang refreshing metad... Closed
Documented
Duplicate
is duplicated by SERVER-52765 Support UUIDs in the catalog cache lo... Closed
Problem/Incident
Related
related to SERVER-35092 ShardServerCatalogCacheLoader should ... Closed
Assigned Teams:
Catalog and Routing
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2018-05-21, Sharding 2018-06-04, Sharding 2018-06-18, Sharding 2018-07-02, Sharding 2018-07-16, Sharding 2018-07-30, Sharding 2018-08-13, Sharding EMEA 2021-06-14, Sharding EMEA 2021-06-28, Sharding EMEA 2021-07-12
Participants:
Case:
Linked BF Score: 0

 Description   

After upgrading from 3.4 to 3.6 I get many of these errors for different sharded collections:

I SHARDING [ShardServerCatalogCacheLoader-4] InvalidNamespace: Failed to update the persisted chunk metadata for collection 'easypalletideas-uorzoqjxdxhndfy_stackpat_40594.easypalletideas-uorzoqjxdxhndfy_stackpat_40594_counters_hourly' from '0|0||000000000000000000000000' to '1|0||5adc9e47ed72f1e7708a691d' due to 'fully qualified namespace config.cache.chunks.easypalletideas-uorzoqjxdxhndfy_stackpat_40594.easypalletideas-uorzoqjxdxhndfy_stackpat_40594_counters_hourly is too long (max is 120 bytes)'. Will be retried.

This did not happen before the upgrade.

I suspect the issue is the change in SERVER-31644 is making the collection name too long despite the original collection being within limits.



 Comments   
Comment by Kaloian Manassiev [ 18/Oct/21 ]

The feature was disabled in 5.1 (under SERVER-58367) until we reach agreement on the backwards-compatibility implications during downgrade.

Comment by Githook User [ 09/Jul/21 ]

Author:

{'name': 'Antonio Fuschetto', 'email': 'antonio.fuschetto@mongodb.com', 'username': 'afuschetto'}

Message: SERVER-34632 config.chunks change to config.cache.chunks creates a collection long name after upgrade
Branch: master
https://github.com/mongodb/mongo/commit/33aaa656979089a8d6530d1ae3ff15335b13508a

Comment by Antonio Fuschetto [ 06/Jul/21 ]

Code review url: https://mongodbcr.appspot.com/798010001

Comment by Julio Viera [ 17/Dec/18 ]

Is there any update, ETA or workaround available for this other than renaming the collections? Thanks!

Comment by Githook User [ 24/May/18 ]

Author:

{'username': 'kaloianm', 'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com'}

Message: SERVER-34632 Rename `struct dbTask` to DBTask

... to follow naming conventions
Branch: master
https://github.com/mongodb/mongo/commit/b919fb48eb611b3c8cbba9d7f03f6df1d25d4cd5

Comment by Githook User [ 24/May/18 ]

Author:

{'username': 'kaloianm', 'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com'}

Message: SERVER-34632 Use alias for the callback of CatalogCacheLoader::getChunksSince

Also use StringMap in CollectionShardingState instead of std::unordered_map.
Branch: master
https://github.com/mongodb/mongo/commit/cb0393248d26e21e69efde15d9d3965293ead29b

Comment by Esha Maharishi (Inactive) [ 16/May/18 ]

I agree, since the 3.6.5 primary's behavior in the mixed-version replica set is pretty much the same as before the fix.

It might be good to confirm that the 3.6.6 secondary will fail reads with non-available readConcern in this case.

Comment by Kaloian Manassiev [ 16/May/18 ]

Remember that this will be backported to 3.6 as well. But yes, if a 3.6.6 node is promoted to a primary, manages to create the view and then a 3.6.5 is promoted to a primary before the entire shard is upgraded, the 3.6.5 primary will fail. Getting out of this situation will either require finishing the upgrade or downgrading to 3.6.5 and manually deleting the created views.

I think that this is a reasonable tradeoff to what would otherwise require writing to two collections (UUID and namespace) and a complex handoff protocol about which collection the secondaries should be reading from.

Comment by Esha Maharishi (Inactive) [ 16/May/18 ]

Hmm, ok, and finally, how does this work in a mixed-version replica set (say two nodes, one is 3.6.x, other is 4.0)?

If the 3.6.x node steps up, will it try to write to "config.cache.chunks.<ns>", see it's a view, and fail?

Comment by Kaloian Manassiev [ 15/May/18 ]

esha.maharishi, I edited the description above to clarify. The drop of the namespace-suffixed collections will happen as part of the regular update of the cache collection. The sequence is: set the cache collection as "in-update" (which will cause the secondaries to disregard what they read and wait until the "in-update" flag is cleared), drop the namespace-suffxied collection (this means secondaries can get no chunks, but it doesn't matter because the will loop around and retry), then create the UUID-suffixed collection + the view (only if the view name doesn't exceed the name size limitations, which I think there shouldn't be any).

Comment by Esha Maharishi (Inactive) [ 15/May/18 ]

The view idea sounds neat. Couple questions -

The primary node will first drop the namespace-suffixed collection, then construct a UUID-named one 

When will this occur? (On startup/transition to primary; on setFCV=4.0?).

Will the drop + creates be atomic? (We might be able to just drop, and let the next refresh do the creates?)

Using config.cache.chunks.<uuid> might also solve SERVER-34878 and SERVER-34904.

Comment by Kaloian Manassiev [ 15/May/18 ]

The plan is to fix this problem though the following changes:

  1. Uncomplicate the ShardServerCatalogCacheLoader a bit, because currently the logic within it is really tangled together and not very malleable to changes. I plan on pulling out the management of the config.system.cache.chunks/collections plus the in-memory queues to a separate class called RoutingInfoCacheCollection which can be unit-tested. This class will be the means to interact with the cache collections. Along with this I plan to throw out most of the StatusWith usages and replacing them with exceptions.
  2. The primary node, as part of the "in-update" logic which it currently uses, will first drop the namespace-suffixed collection, then construct a UUID-named one and, in order to preserve backwards compatibility with 3.6.x versions which do not contain this fix, will also create a view with the namespace-named collection which is an alias for the UUID.
  3. The secondary nodes will first look for an UUID-suffixed cache collection and read from it and if that is not available, then back-off to the name-suffixed cache collection only (not that there is no need to look for the view, because the presence of a view means that there must be a UUID-suffixed cache collection).

renctan, esha.maharishi, schwerin, can you please review this plan?

Comment by Shay [ 30/Apr/18 ]

Hi Kal,

 

Thank you for your response.

 

We are working on shortening the collection names, however, this is hard to do without downtime to our application.

 

I would suggest until a fix is made to add a comment to the limits documentation and the upgrade documentation from 3.4 to 3.6 to help others avoid this issue.

 

please update when you have an estimate of when a fix will be made available, and what changes will be made.

 

Regards,

  Shay

Comment by Kaloian Manassiev [ 30/Apr/18 ]

Hi Rybak,

Sorry for the silence on this ticket. We are aware of what is causing the problem and are working on coming up with a solution. Unfortunately there is no workaround currently other than using a shorter collection name (I noticed there is some name duplication in the collection name you pasted).

To give you a little bit of context: these warning messages are indication that the shard chunk filtering metadata could not be persisted on the primary and as such reads against secondary nodes with anything other than the default read concern will not work. In addition, because these failed operations are retried internally, they may build-up in-memory state overtime and cause the server's memory usage to grow unbounded.

Apologies for the inconvenience and please continue monitoring this ticket for when this fix will be available.

Best regards,

-Kal.

 

 

Generated at Thu Feb 08 04:37:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.