[SERVER-79049] Server returns unexpected CollectionUUIDMismatch with actual collection name equal to expected collection name Created: 18/Jul/23  Updated: 06/Feb/24

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: 7.1.0-rc0, 6.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Craven Huynh Assignee: Jordi Olivares Provencio
Resolution: Unresolved Votes: 0
Labels: shardingemea-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File simpleRepro.js    
Issue Links:
Problem/Incident
is caused by SERVER-63285 Ensure CollectionUUIDMismatch error f... Closed
Related
related to SERVER-76624 Server can falsely report CollectionU... Closed
Assigned Teams:
Catalog and Routing
Operating System: ALL
Sprint: Sharding EMEA 2023-10-16, Sharding EMEA 2023-10-30, CAR Team 2023-11-13, CAR Team 2023-11-27, CAR Team 2023-12-11, CAR Team 2023-12-25, CAR Team 2024-01-08, CAR Team 2024-02-05, CAR Team 2024-02-19
Participants:
Story Points: 2

 Description   

In a Mongosync test, we encountered a CollectionUUIDMismatch where the actual collection name was equal to the expected collection when issuing a delete command on a sharded collection in a 3-shard cluster.

 

The collection UUID of interest is e97a6bd1-498d-4dbf-8477-d77190fb744b for namespace "testDB.testColl2".

 

A first observation from the mongos logs is that the collection testColl2 consists of 2 chunks, one in shard dst-sh01 and another in dst-sh02, leaving dst-sh03 with no chunks. This was explicitly setup by Mongosync using the "updateZoneKeyRange" command on those two shards with those specific chunks and intentionally omitting dst-sh03.

{"t":{"$date":"2023-07-17T20:36:38.985+00:00"},"s":"D3", "c":"EXECUTOR", "id":22608,   "ctx":"ShardRegistry","msg":"Received remote response","attr":{"response":"RemoteOnAnyResponse --  cmd: { cursor: { firstBatch: [ { _id: \"testDB.testColl2\", lastmodEpoch: ObjectId('64b5a6557ce7fa44
aa9b9b67'), lastmod: new Date(1689626198046), timestamp: Timestamp(1689626197, 140), uuid: UUID(\"e97a6bd1-498d-4dbf-8477-d77190fb744b\"), key: { _id: 1 }, unique: false, noBalance: false }, { chunks: { _id: ObjectId('64b5a656bbcf3f85c9eb21f7'), uuid: UUID(\"e97a6bd1-498d-4dbf-8477-d77
190fb744b\"), min: { _id: MinKey }, max: { _id: 74 }, shard: \"dst-sh01\", lastmod: Timestamp(1, 0), onCurrentShardSince: Timestamp(1689626197, 140), history: [ { validAfter: Timestamp(1689626197, 140), shard: \"dst-sh01\" } ] } }, { chunks: { _id: ObjectId('64b5a656bbcf3f85c9eb21f8'),
 uuid: UUID(\"e97a6bd1-498d-4dbf-8477-d77190fb744b\"), min: { _id: 74 }, max: { _id: MaxKey }, shard: \"dst-sh02\", lastmod: Timestamp(1, 1), onCurrentShardSince: Timestamp(1689626197, 140), history: [ { validAfter: Timestamp(1689626197, 140), shard: \"dst-sh02\" } ] } } ], id: 0, ns: 
\"config.collections\", atClusterTime: Timestamp(1689626198, 120) }, ok: 1.0, $clusterTime: { clusterTime: Timestamp(1689626198, 134), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configTime: Timestamp(1689626198, 120), $topologyTime: Timestam
p(1689626063, 2), operationTime: Timestamp(1689626198, 120) } target: localhost:28121 status: OK elapsedMicros: 2900 μs moreToCome: false"}} 

 

The delete command is shown in the mongos line: 

{"t":{"$date":"2023-07-17T20:36:38.986+00:00"},"s":"D4", "c":"ASIO",     "id":22596,   "ctx":"conn259","msg":"startCommand","attr":{"request":"RemoteCommand 12610 -- target:[localhost:28028] db:testDB expDate:2023-07-17T20:41:38.990+00:00 cmd:{ delete: \"testColl2\", bypassDocumentVali
dation: false, ordered: false, collectionUUID: UUID(\"e97a6bd1-498d-4dbf-8477-d77190fb744b\"), deletes: [ { q: { $and: [ { $expr: { $gte: [ \"$_id\", { $literal: 0 } ] } }, { $expr: { $lt: [ \"$_id\", { $literal: 99 } ] } } ] }, limit: 0, hint: { _id: 1 } } ], shardVersion: { e: Object
Id('00000000ffffffffffffffff'), t: Timestamp(4294967295, 4294967295), v: Timestamp(0, 0) }, writeConcern: { w: \"majority\", j: true, wtimeout: 120000 }, lsid: { id: UUID(\"6e1d23ea-ac74-484f-ab44-341b08fcbfac\"), uid: BinData(0, FFEE4C8F085C89ED4D839F06F64A1B5513D7CA6A5BC5F3EE7052E138
5F4965D3) } }"}}

This command is being sent to all three shards, include dst-sh03 at localhost:28028 which doesn't have any testColl2 chunks.

 

The commands that are sent to dst-sh01 and dst-sh02 return without any errors.

 

The command sent to dst-sh03 returns with a CollectionUUIDMismatch error:

{"t":{"$date":"2023-07-17T20:36:38.987+00:00"},"s":"D2", "c":"ASIO",     "id":22597,   "ctx":"conn259","msg":"Request finished with response","attr":{"requestId":12610,"isOK":true,"response":"{ n: 0, electionId: ObjectId('7fffffff0000000000000002'), opTime: { ts: Timestamp(1689626198, 76), t: 2 }, writeErrors: [ { index: 0, code: 361, errmsg: \"Collection UUID does not match that specified\", db: \"testDB\", collectionUUID: UUID(\"e97a6bd1-498d-4dbf-8477-d77190fb744b\"), expectedCollection: \"testColl2\", actualCollection: null } ], ok: 1.0, $clusterTime: { clusterTime: Timestamp(1689626198, 134), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configTime: Timestamp(1689626198, 120), $topologyTime: Timestamp(1689626063, 2), operationTime: Timestamp(1689626198, 76) }"}} 

The expectedCollection is "testColl2" and the actualCollection is null.

 

However, when this response gets to Mongosync the error becomes 

{\"index\": {\"$numberInt\":\"0\"},\"code\": {\"$numberInt\":\"361\"},\"errmsg\": \"Collection UUID does not match that specified\",\"db\": \"testDB\",\"collectionUUID\": {\"$binary\":{\"base64\":\"6Xpr0UmNTb+Ed9dxkPt0Sw==\",\"subType\":\"04\"}},\"expectedCollection\": \"testColl2\",\"actualCollection\": \"testColl2\"}"} 

where expectedCollection and actualCollection are both "testColl2".

Mongosync was not expecting to receive a CollectionUUIDMismatch error at all since testColl2 exists.

Since the returned actual collection name is the same as the expected collection name, the delete command is retried without changing the expected collection name. This results in the same CollectionUUIDMismatch error. In this test, Mongosync retries 5 times before giving up and erroring out.

 

This might be linked to SERVER-76624, and that its is important for mongosync that we get this resolved.

 



 Comments   
Comment by Haley Connelly [ 14/Sep/23 ]

I spoke with jordi.serra-torrens@mongodb.com and we agree this should be tackled by sharding.  I've also attached simpleRepro.js which reproduces the behavior.

 

Comment by Craven Huynh [ 19/Jul/23 ]

gregory.noma@mongodb.com do you mean that using collectionUUID option in a write that might broadcast to all shards doesn't make sense?

Mongosync must use the collectionUUID option in a delete many command because otherwise we might delete documents from the wrong collection following a collection rename. We are open to explore workarounds that don't require the use of delete many though.

Having the server reconcile the CollectionUUIDMismatch internally would be the ideal resolution for Mongosync.

Comment by Jordi Serra Torrens [ 19/Jul/23 ]

Is there a reason we need to target all shards here, as opposed to only shards which own chunks for the collection?

The problem is that when a router broadcasts writes to shards, it attaches ShardVersion::IGNORED instead of the proper shard version. The reason is two-fold: (i) If the write is non-idempontent we want to avoid retrying it if one shard throws StaleConfig; (ii) large multi-writes may not converge (i.e. finish before some migration commits and changes the ShardVersion). Since ShardVersion::IGNORED means that shards will not inform the router that its routing table was stale, the router does the most pessimistic thing: broadcast the operation to all shards in the cluster (not only those that the router believes have chunks for the collection)

Comment by Craven Huynh [ 19/Jul/23 ]

Maybe chunk migration is a reason why we potentially need to target all shards including those that don't own any chunks.

Comment by Gregory Noma [ 19/Jul/23 ]

1. Client executed a delete with expectedUUID, limit=0 and not targeted by exact shard key equality query. Therefore, the router decides it needs to target ALL shards – even those that do not have any chunks (such as sh3)

Hmm yeah seems like collectionUUID wouldn't work correctly in this case. Is there a reason we need to target all shards here, as opposed to only shards which own chunks for the collection?

Comment by Rohan Sharan [ 18/Jul/23 ]

Note, based on our initial investigation, it seems that this behavior may not be self correcting

Comment by Jordi Serra Torrens [ 18/Jul/23 ]

The following happened:
1. Client executed a delete with expectedUUID, limit=0 and not targeted by exact shard key equality query. Therefore, the router decides it needs to target ALL shards – even those that do not have any chunks (such as sh3)
2. Since sh3 does not own any chunk of the collection, it doesn't even have it created locally. Therefore, when it executes its part of the delete, the expectedUUID check does not find any local collection with that uuid so it will fail with CollectionUUIDMismatch and report expectedNs=null.
3. The router catches this error and populates the expectedNamespace by asking the db-primary shard for the nss associated to the wanted uuid. Finally, the CollectionUUIDMismatch error returned to the client has the actualNss == expectedNss

Comment by Rohan Sharan [ 18/Jul/23 ]

Relinking the task logs here: https://spruce.mongodb.com/task/mongosync_amazon2_arm64_e2e_unlike_sc2_sc3_patch_4b82b7341dbe0852e893d027df22223f50edc875_64b5a54ea4cf478fdde42fc4_23_07_17_20_32_16/tests?execution=0&sortBy=STATUS&sortDir=ASC. The mongod and mongos logs are available in the files tab, with higher verbosity. 

 

If repro instructions are needed, they can be provided.

Generated at Thu Feb 08 06:39:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.