[SERVER-79049] Server returns unexpected CollectionUUIDMismatch with actual collection name equal to expected collection name Created: 18/Jul/23 Updated: 06/Feb/24 |
|
| Status: | Open |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 7.1.0-rc0, 6.0.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Craven Huynh | Assignee: | Jordi Olivares Provencio |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | shardingemea-qw | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Catalog and Routing
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Sharding EMEA 2023-10-16, Sharding EMEA 2023-10-30, CAR Team 2023-11-13, CAR Team 2023-11-27, CAR Team 2023-12-11, CAR Team 2023-12-25, CAR Team 2024-01-08, CAR Team 2024-02-05, CAR Team 2024-02-19 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Story Points: | 2 | ||||||||||||||||
| Description |
|
In a Mongosync test, we encountered a CollectionUUIDMismatch where the actual collection name was equal to the expected collection when issuing a delete command on a sharded collection in a 3-shard cluster.
The collection UUID of interest is e97a6bd1-498d-4dbf-8477-d77190fb744b for namespace "testDB.testColl2".
A first observation from the mongos logs is that the collection testColl2 consists of 2 chunks, one in shard dst-sh01 and another in dst-sh02, leaving dst-sh03 with no chunks. This was explicitly setup by Mongosync using the "updateZoneKeyRange" command on those two shards with those specific chunks and intentionally omitting dst-sh03.
The delete command is shown in the mongos line:
This command is being sent to all three shards, include dst-sh03 at localhost:28028 which doesn't have any testColl2 chunks.
The commands that are sent to dst-sh01 and dst-sh02 return without any errors.
The command sent to dst-sh03 returns with a CollectionUUIDMismatch error:
The expectedCollection is "testColl2" and the actualCollection is null.
However, when this response gets to Mongosync the error becomes
where expectedCollection and actualCollection are both "testColl2". Mongosync was not expecting to receive a CollectionUUIDMismatch error at all since testColl2 exists. Since the returned actual collection name is the same as the expected collection name, the delete command is retried without changing the expected collection name. This results in the same CollectionUUIDMismatch error. In this test, Mongosync retries 5 times before giving up and erroring out.
This might be linked to
|
| Comments |
| Comment by Haley Connelly [ 14/Sep/23 ] |
|
I spoke with jordi.serra-torrens@mongodb.com and we agree this should be tackled by sharding. I've also attached simpleRepro.js
|
| Comment by Craven Huynh [ 19/Jul/23 ] |
|
gregory.noma@mongodb.com do you mean that using collectionUUID option in a write that might broadcast to all shards doesn't make sense? Mongosync must use the collectionUUID option in a delete many command because otherwise we might delete documents from the wrong collection following a collection rename. We are open to explore workarounds that don't require the use of delete many though. Having the server reconcile the CollectionUUIDMismatch internally would be the ideal resolution for Mongosync. |
| Comment by Jordi Serra Torrens [ 19/Jul/23 ] |
The problem is that when a router broadcasts writes to shards, it attaches ShardVersion::IGNORED instead of the proper shard version. The reason is two-fold: (i) If the write is non-idempontent we want to avoid retrying it if one shard throws StaleConfig; (ii) large multi-writes may not converge (i.e. finish before some migration commits and changes the ShardVersion). Since ShardVersion::IGNORED means that shards will not inform the router that its routing table was stale, the router does the most pessimistic thing: broadcast the operation to all shards in the cluster (not only those that the router believes have chunks for the collection) |
| Comment by Craven Huynh [ 19/Jul/23 ] |
|
Maybe chunk migration is a reason why we potentially need to target all shards including those that don't own any chunks. |
| Comment by Gregory Noma [ 19/Jul/23 ] |
Hmm yeah seems like collectionUUID wouldn't work correctly in this case. Is there a reason we need to target all shards here, as opposed to only shards which own chunks for the collection? |
| Comment by Rohan Sharan [ 18/Jul/23 ] |
|
Note, based on our initial investigation, it seems that this behavior may not be self correcting |
| Comment by Jordi Serra Torrens [ 18/Jul/23 ] |
|
The following happened: |
| Comment by Rohan Sharan [ 18/Jul/23 ] |
|
Relinking the task logs here: https://spruce.mongodb.com/task/mongosync_amazon2_arm64_e2e_unlike_sc2_sc3_patch_4b82b7341dbe0852e893d027df22223f50edc875_64b5a54ea4cf478fdde42fc4_23_07_17_20_32_16/tests?execution=0&sortBy=STATUS&sortDir=ASC. The mongod and mongos logs are available in the files tab, with higher verbosity.
If repro instructions are needed, they can be provided. |