[SERVER-82129] FCV 5.0 Upgrade fails due to config.cache.collections missing UUIDs for most collections Created: 12/Oct/23  Updated: 29/Dec/23  Resolved: 29/Dec/23

Status: Closed
Project: Core Server
Component/s: Catalog
Affects Version/s: 5.0.21
Fix Version/s: 5.0.24

Type: Bug Priority: Major - P3
Reporter: Scott Glajch Assignee: Jordi Serra Torrens
Resolution: Fixed Votes: 1
Labels: car-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File fcv_upgrade_fails_without_uuid.js    
Issue Links:
Related
Assigned Teams:
Catalog and Routing
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

1. Have an old mongo cluster (perhaps before UUIDs for collections, so 3.6?)
2. Have lots of collections
3. Upgrade mongo until binaries are at 5.0 and FCV is at 4.4
4. Attempt to execute FCV upgrade to 5.0

Sprint: Sharding EMEA 2023-10-30, CAR Team 2023-11-13, CAR Team 2023-11-27, CAR Team 2023-12-25, CAR Team 2024-01-08
Participants:
Case:
Story Points: 2

 Description   

When upgrading our oldest running production mongo cluster from 4.4 to 5.0 FCV (feature compatibility version), the operation fails.
This mongo cluster has been continuously running since something like 2.4 or 2.6, early days. Our other clusters which are newer did not have this issue.

 

 

mongos> db.adminCommand( { setFeatureCompatibilityVersion: "5.0", writeConcern: { w: "majority", wtimeout: 900000 } } )
{
    "ok" : 0,
    "errmsg" : "Failed command { _flushRoutingTableCacheUpdatesWithWriteConcern: \"REDACTED_DB1.REDACTED_COLLECTION1\", syncFromConfig: true, writeConcern: { w: \"majority\", wtimeout: 60000 } } for database 'admin' on shard 'rs_prod1_shard24' :: caused by :: Failed to read persisted collections entry for collection 'REDACTED_DB1.REDACTED_COLLECTION1'. :: caused by :: Failed to read the 'REDACTED_DB1.REDACTED_COLLECTION1' entry locally from config.collections :: caused by :: BSON field 'ShardCollectionType.uuid' is missing but a required field",
    "code" : 40414,
    "codeName" : "Location40414",
    "$clusterTime" : {
        "clusterTime" : Timestamp(1696887685, 3143),
        "signature" : {
            "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
            "keyId" : NumberLong(0)
        }
    },
    "operationTime" : Timestamp(1696887680, 5271)

 

 

We have a 30 shard cluster, each of which are replica sets.
When I inspect the config.collections metadata from one of our mongos nodes, I see that all collections do have a UUID:

mongos> config.collections.count({uuid: {$exists: 0}})
0
mongos> config.collections.count({uuid: {$exists: 1}})
79508

However when I go to the primary for some of the shards (shards 29 and 30 output below), you can see that a ton of collections have no UUID directly on the shard, according to the `config.cache.collections` collection information.

rs_prod1_shard29:PRIMARY> config.cache.collections.count({uuid: {$exists: 0}})
18337
rs_prod1_shard29:PRIMARY> config.cache.collections.count({uuid: {$exists: 1}})
56378

rs_prod1_shard30:PRIMARY> config.cache.collections.count({uuid: {$exists: 0}})
29756
rs_prod1_shard30:PRIMARY> config.cache.collections.count({uuid: {$exists: 1}})
36651 

Taking one of the collections without a UUID and doing a count on it on that shard shows that all of these collections have 0 count of documents (on this shard). Most of these collections aren't actually non-0 in size, just non-0 in data on this specific shard.

rs_prod1_shard30:PRIMARY> var collectionsWithoutUUID = config.cache.collections.find({uuid: {$exists: 0}});
var collectionsIterated = 0;
while (collectionsWithoutUUID.hasNext()) {
    var collectionCacheInfo = collectionsWithoutUUID.next();
    var databaseName = collectionCacheInfo._id.split(".")[0];
    var collectionName = collectionCacheInfo._id.split(".")[1];
    var count = db.getSiblingDB(databaseName)[collectionName].count();
    if (count !== 0) {
        print(collectionCacheInfo._id + "     -    " + count);
    }
}

We've noticed that there were a number of collections before the FCV upgrade that had been created on shards where the chunks didn't live, even before the upgrade.
So you'd have a collection, say, where there's 1 (or many!) chunks all on 1 shard, but other shards have the collection as existing, even though the shard doesn't own any chunks for the collection.
These empty/phantom collections would have weird properties. Often they didn't have the index definitions that the (real) chunk/shard had for the collection, and often they had a UUID mismatch as well.
In some cases that UUID mismatch would prevent chunks getting distributed, which in turn slowed down or blocked the balancer, because the balancer would (correctly) find these chunks as needing to be moved but unable to do so (and would keep trying every balancer run)

For a while we've been living with this situation by simply finding the bad collections as the balancer found them, and then going to the 0-sized, 0-chunks-living-on-that-shard collection and dropping it (from those shards only that didn't own documents and chunks), doing so would unblock the balancer for that collection.

We were hoping to do a full sweep of all collections on all shards and find and fix these scenarios, but since the action to be taken was "drop collection locally", we recognized that this could be a very risky move and hadn't gotten around to devoting the proper time/care to fully remediate.

In addition to logging this bug, I'd like to ask advice on what's the best way forward?

We are left in a situation where the target FCV version is 5.0 but the actual version is 4.4 in all shards, which prevents us some running some operations nicely (had trouble when reconfiguring a replica set yesterday, but not sure what else will break in this scenario medium-term).

One idea would be to manually set the UUID objects on all shards using the source of truth (the config server itself), as it does seem to be set there.

Another would be to systematically ensure that these collections are indeed "phantom" or "orphaned" on those shards and drop them all.

 



 Comments   
Comment by Githook User [ 28/Dec/23 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordist@users.noreply.github.com', 'username': 'jordist'}

Message: SERVER-82129 Discard ShardServerCatalogCacheLoader persisted cache if it's missing the uuid field (#17758)

GitOrigin-RevId: 5beaa1ce72d4a5f490623dcc7b67ecb11864a28c
Branch: v5.0
https://github.com/mongodb/mongo/commit/be3f898793eb9d85a1c12833625a91e554878e9d

Comment by Kaloian Manassiev [ 27/Nov/23 ]

Hi sglajch@evergage.com,

Apologies for the delayed response and thank you very much for all the investigation effort that you have made. Your findings make perfect sense.

I have written a longer context for the problem below, but TL;DR I believe that the safest fix would be to remove the required flag for it in order to get the refresh to proceed and properly patch it up. This will also guarantee us that any post-5.0 clusters will enter with the cache collections correctly updated.

TL;DR - these are the factors that contributed to this problem:

  1. When chunks are migrated off shards, we don't actually have a process to clear the config.cache.collections/chunks entries for the collections that they belong to, even if the last chunk leaves a shard. Because of this there are stale entries which remained on previous owners of chunks, such as the DB primary.
  2. As part of the 5.0 upgrade procedure after we add the UUID and Timestamp fields to the entries on the ConfigShard, we tell all shards to sync with it and not just the current owners of these chunks. The reason we cannot just limit the current owners, is because otherwise we would have had to implement the shard versioning protocol fully for even this part of the process, otherwise potentially we can miss shards (or still send to the wrong shards). Because of this we applied a blanket solution which just tells all the shards to sync their caches with whatever we just wrote on the ConfigShard.
  3. In version 5.0 we made the UUID entry from the config.cache.collections document to be required and the first step that a refresh of the shard takes is to read and parse that entry, but because this entry has been there for a very long time and never got patched up, it now complains that it is missing.
Comment by Scott Glajch [ 19/Oct/23 ]

Hello there,

I have 2 pieces of news:
1st, after repairing the missing UUIDs on all the shards themselves, we were able to get the FCV upgrade from 4.4 to 5.0 to work! It took about a day to create all the indexes for and create all the new config.cache.chunks collections that live on each shard during the FCV upgrade, but it worked!
Before I move on from that point, I'd like to note that the ONLY scenario I manually fixed up prior to the migration is the case where the shard-specific collections of config.cache.collections had entries that had missing UUIDs, AND those UUIDs existed on the configserver in config.collections.
I did NOT repair the cases where there were config.cache.collections shard-local entries with UUIDs missing, but those collections also didn't exist up in the configserver in config.collections. About 15K of such instances of those are still present in each of our 30 shards, but it didn't seem to block the FCV upgrade.

 

2nd, I was able to reproduce this FCV upgrade failure in a local environment using my 1-6 steps in the comment above.

Basically my steps were:
1. Setup a 4.4 mongo binary cluster (I used 4.4.16, as those were the binaries I had on hand), with sharded replicasets. Single node replica sets were sufficient for this (though obviously out in production we don't use single node replicasets)
2. Stop the processes and upgrade the mongo binaries to 5.0 (I used 5.0.21)
3. Start the binaries and create a sharded collection.

// On the MongoS node
db.getSiblingDB("test123");
db.createCollection("shardedColl1");
sh.enableSharding("test123");
sh.shardCollection("test123.shardedColl1", {_id: 1});

4. Find out which shard the initial chunk got put into when the collection got sharded
5. Take the output of the entry on that shard by going to the primary shard node and running:

// On a shard node where the chunk lives
config.cache.collections.find({ns: "test123.shardedColl1"}); 

6. In a new terminal tab, login to a different shard node, and paste that object, without the UUID, into an object like so. And then insert it:

// On a different shard node than where the chunk lives
var badCachedCollectionObject = {
    "_id" : "test123.shardedColl1",
    "epoch" : ObjectId("6531789ed086d9c7e8a8230b"), // example
    "key" : {
        "_id" : 1
    },
    "unique" : false,
    "refreshing" : false,
    "lastRefreshedCollectionVersion" : Timestamp(1, 0)
}
config.cache.collections.insert(badCachedCollectionObject); 

7. Attempt to do the FCV 5.0 upgrade:

// On a MongoS node
db.adminCommand( { setFeatureCompatibilityVersion: "5.0", writeConcern: { w: "majority", wtimeout: 900000 } } );

 

 

Comment by Scott Glajch [ 18/Oct/23 ]

We are using 5.0.21

I think I have some more information to help the shape of the problem.

I'm considering trying to reproduce this with a local environment manually by doing a few steps to mimic this environment.

There are a bunch of issues/failure scenarios, but the one I think is most problematic to the FCV 5.0 upgrade is this:

1. There exists a collection at the toplevel (dbName.collectionName.exists() returns non-null)
2. This collection is sharded and has only 1 empty chunk, or most of the time 1 empty chunk (configserver sees it in sharding information in config.collections)
3. This collection has a valid UUID at the configserver level (can check config.collections.findOne({_id: "dbName.collectionName"}).uuid)
4a. This collection exists on the one shard where that chunk lives (mostly what I've found). On the shard itself, do dbName.collectionName.exists()
4b. This collection's uuid on THIS shard matches the UUID up on the configserver (dbName.collectionName.exists().info.uuid)
5a. This collection does NOT exist on other shard, but on some of the other shards, it DOES have an entry in config.cache.collection({_id: "dbName.collectionName"})
5b. In that shard in the config.cache.collection entry, it either exists with a mismatched UUID to the configserver (rare), or a missing UUID (more common)

In fact each of our 30 shards have somewhere between 7000-13000 entries like this (config.cache.collections has an entry in that shard, but without a UUID, and the collection doesn't actually exist on that shard nor have any chunks assigned to it)
In each of our 30 shards there were only 5-10 collections where the UUID was mismatched instead of just missing. We think, our theory is that, (based on collection name and history of those collections), those are instances where it used to look like the above (extra hanging info about this collection on said shard), but then we dropped the collection and re-created one with the same name later. This would explain the configserver + the shard that did own a chunk getting a new UUID but other shards not getting one.

Regardless I have cleaned up both of those instances using a looping script.
I found the actual configserver approved UUID for collections, looped over every primary on each of the 30 shards, and then looped over config.cache.collections entries w/out UUIDs, and did an update on the entry to set the proper UUID.

After executing those updates, I've restarted the FCV 4.4 -> 5.0 upgrade, and instead of failing within the first few minutes as it was before, it's chugging along now.
Judging by how long this took in other mongo deployment/clusters, combined with the # of chunks in this cluster, I estimate that the FCV upgrade will take just over 1 full day, so hopefully I can report back about it in a day or two.

That being said, I think the following data structure might reproduce the issue (in lieu of having to go back to mongo 3.6 or earlier and reproduce bugs there...)
I'm hoping to try this soon locally:
1. Install mongo 4.4 and add a few shards.
2. Add some collections with sharding. Don't bother adding data, I don't think this is necessary. Leave them as empty 1-chunk collections
3. Manually go to the primaries for shards that DON'T house the single chunks for these collections.
4. On those shards, manually add an entry to config.cache.collections() with the right values, but omit the UUID field from the document.
5. Upgrade to mongo 5.0 binaries
6. Attempt FCV upgrade

Comment by Kaloian Manassiev [ 17/Oct/23 ]

Thank you sglajch@evergage.com for reporting this issue. Can you please let me know exactly which 5.0 version are you using?

So far I was able to confirm that sharded collections created prior to FCV 3.6 will not contain the UUID field in the config.cache.collections entry (because that's when we introduced it) and that UUID will not be added until the node reaches to the 5.0 binary (even in FCV 5.0).

However I have not been able to reproduce the setFCV problem yet.

Comment by Scott Glajch [ 12/Oct/23 ]

I'm able to confirm, using the collection.exists() API (which comes back as null for these collections), that these collections don't actually exist (at least not anymore) on the shards, it's just that they exist in the config.cache.collections list, but without a UUID, so I think patching up the UUIDs is the right move then?

Generated at Thu Feb 08 06:48:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.