[SERVER-82129] FCV 5.0 Upgrade fails due to config.cache.collections missing UUIDs for most collections Created: 12/Oct/23 Updated: 29/Dec/23 Resolved: 29/Dec/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Catalog |
| Affects Version/s: | 5.0.21 |
| Fix Version/s: | 5.0.24 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Scott Glajch | Assignee: | Jordi Serra Torrens |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | car-qw | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Assigned Teams: |
Catalog and Routing
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Steps To Reproduce: | 1. Have an old mongo cluster (perhaps before UUIDs for collections, so 3.6?) |
||||
| Sprint: | Sharding EMEA 2023-10-30, CAR Team 2023-11-13, CAR Team 2023-11-27, CAR Team 2023-12-25, CAR Team 2024-01-08 | ||||
| Participants: | |||||
| Case: | (copied to CRM) | ||||
| Story Points: | 2 | ||||
| Description |
|
When upgrading our oldest running production mongo cluster from 4.4 to 5.0 FCV (feature compatibility version), the operation fails.
We have a 30 shard cluster, each of which are replica sets.
However when I go to the primary for some of the shards (shards 29 and 30 output below), you can see that a ton of collections have no UUID directly on the shard, according to the `config.cache.collections` collection information.
Taking one of the collections without a UUID and doing a count on it on that shard shows that all of these collections have 0 count of documents (on this shard). Most of these collections aren't actually non-0 in size, just non-0 in data on this specific shard.
We've noticed that there were a number of collections before the FCV upgrade that had been created on shards where the chunks didn't live, even before the upgrade. For a while we've been living with this situation by simply finding the bad collections as the balancer found them, and then going to the 0-sized, 0-chunks-living-on-that-shard collection and dropping it (from those shards only that didn't own documents and chunks), doing so would unblock the balancer for that collection. We were hoping to do a full sweep of all collections on all shards and find and fix these scenarios, but since the action to be taken was "drop collection locally", we recognized that this could be a very risky move and hadn't gotten around to devoting the proper time/care to fully remediate. In addition to logging this bug, I'd like to ask advice on what's the best way forward? We are left in a situation where the target FCV version is 5.0 but the actual version is 4.4 in all shards, which prevents us some running some operations nicely (had trouble when reconfiguring a replica set yesterday, but not sure what else will break in this scenario medium-term). One idea would be to manually set the UUID objects on all shards using the source of truth (the config server itself), as it does seem to be set there. Another would be to systematically ensure that these collections are indeed "phantom" or "orphaned" on those shards and drop them all.
|
| Comments |
| Comment by Githook User [ 28/Dec/23 ] | |||||||||||||||||||||
|
Author: {'name': 'Jordi Serra Torrens', 'email': 'jordist@users.noreply.github.com', 'username': 'jordist'}Message: GitOrigin-RevId: 5beaa1ce72d4a5f490623dcc7b67ecb11864a28c | |||||||||||||||||||||
| Comment by Kaloian Manassiev [ 27/Nov/23 ] | |||||||||||||||||||||
|
Apologies for the delayed response and thank you very much for all the investigation effort that you have made. Your findings make perfect sense. I have written a longer context for the problem below, but TL;DR I believe that the safest fix would be to remove the required flag for it in order to get the refresh to proceed and properly patch it up. This will also guarantee us that any post-5.0 clusters will enter with the cache collections correctly updated. TL;DR - these are the factors that contributed to this problem:
| |||||||||||||||||||||
| Comment by Scott Glajch [ 19/Oct/23 ] | |||||||||||||||||||||
|
Hello there, I have 2 pieces of news:
2nd, I was able to reproduce this FCV upgrade failure in a local environment using my 1-6 steps in the comment above. Basically my steps were:
4. Find out which shard the initial chunk got put into when the collection got sharded
6. In a new terminal tab, login to a different shard node, and paste that object, without the UUID, into an object like so. And then insert it:
7. Attempt to do the FCV 5.0 upgrade:
| |||||||||||||||||||||
| Comment by Scott Glajch [ 18/Oct/23 ] | |||||||||||||||||||||
|
We are using 5.0.21 I think I have some more information to help the shape of the problem. I'm considering trying to reproduce this with a local environment manually by doing a few steps to mimic this environment. There are a bunch of issues/failure scenarios, but the one I think is most problematic to the FCV 5.0 upgrade is this: 1. There exists a collection at the toplevel (dbName.collectionName.exists() returns non-null) In fact each of our 30 shards have somewhere between 7000-13000 entries like this (config.cache.collections has an entry in that shard, but without a UUID, and the collection doesn't actually exist on that shard nor have any chunks assigned to it) Regardless I have cleaned up both of those instances using a looping script. After executing those updates, I've restarted the FCV 4.4 -> 5.0 upgrade, and instead of failing within the first few minutes as it was before, it's chugging along now. That being said, I think the following data structure might reproduce the issue (in lieu of having to go back to mongo 3.6 or earlier and reproduce bugs there...) | |||||||||||||||||||||
| Comment by Kaloian Manassiev [ 17/Oct/23 ] | |||||||||||||||||||||
|
Thank you sglajch@evergage.com for reporting this issue. Can you please let me know exactly which 5.0 version are you using? So far I was able to confirm that sharded collections created prior to FCV 3.6 will not contain the UUID field in the config.cache.collections entry (because that's when we introduced it) and that UUID will not be added until the node reaches to the 5.0 binary (even in FCV 5.0). However I have not been able to reproduce the setFCV problem yet. | |||||||||||||||||||||
| Comment by Scott Glajch [ 12/Oct/23 ] | |||||||||||||||||||||
|
I'm able to confirm, using the collection.exists() API (which comes back as null for these collections), that these collections don't actually exist (at least not anymore) on the shards, it's just that they exist in the config.cache.collections list, but without a UUID, so I think patching up the UUIDs is the right move then? |