[SERVER-77685] Server can return CollectionUUIDMismatch with actual collection null if collection exists Created: 01/Jun/23  Updated: 29/Oct/23  Resolved: 06/Jul/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Task Priority: Major - P3
Reporter: Rohan Sharan Assignee: Jordi Olivares Provencio
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
backported by SERVER-78663 [7.0] Server can return CollectionUUI... Closed
Depends
is depended on by SERVER-78375 CollectionUUIDMismatch check occurs b... Closed
Duplicate
is duplicated by SERVER-78375 CollectionUUIDMismatch check occurs b... Closed
Problem/Incident
Related
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Backport Requested:
v7.0
Sprint: Execution EMEA Team 2023-06-26, Execution EMEA Team 2023-07-10
Participants:

 Description   

In mongosync, I am seeing that in some cases, on v7.0+ and in sharded clusters, that Collection UUID errors are being returned like the following:

[j1:cl1:s1:prim] | 2023-05-31T20:56:32.038+00:00 D1 ASSERT   23074   [conn96] "User assertion","attr":
{
    error: "CollectionUUIDMismatch{ db: \"test\", collectionUUID: UUID(\"86ac3d10-3d4b-4951-927c-156c87640af7\"), expectedCollection: \"use_cases\", actualCollection: null }: Collection UUID does not match that specified",
    file: "src/mongo/db/catalog/collection_uuid_mismatch.cpp",
    line: 70
}

In these errors the actualCollection field is null, which should only happen if the collection hasn't been created yet, or has been dropped. However, the collection with this UUID should exist on the destination cluster and there is no drop that should be happening for the collection with that UUID.

The following rough order of events happen in the BF I am seeing:

  1. the previous collection use_cases is dropped
  2. a temporary collection mongosync.tmp.<uuid> is created
  3. the temporary collection is renamed to use_cases
  4. a document tries to be inserted by mongosync into the new collection use_cases and fails with the CollectionUUIDMismatch error

Note that 2, 3 and 4 don't necessarily happen one after the other (they can happen at different relative times because of the parallelism in mongosync). Because of step 3, we expect some CollectionUUIDMismatch errors, but in this scenario, I either expect it not to throw one, or I expect it to have the actual collection field be mongosync.tmp.<uuid>, and not null. The end result of this is that mongosync ends up not applying some CRUD events (typically inserts) and has data inconsistency between the source and destination clusters.

The logs are here: https://parsley.mongodb.com/resmoke/98b363f8d81a6e1ffbe259c78c5b19f9/all?bookmarks=0,1089885,1090430,1090446,1090603,1090721,1090805,1091107,1193110&highlights=CollectionUUIDMismatch.%2Ause_cases,use_cases&shareLine=1090721

This is blocking mongosync v7.0 support, and is causing some BFs on our waterfall (because we run some integration tests on latest).



 Comments   
Comment by Rohan Sharan [ 12/Jul/23 ]

It seems that the problem was actually just a new test added in v7.0 that is exposing a pre-existing limitation of mongosync. Sorry for the confusion.

Comment by Rohan Sharan [ 11/Jul/23 ]

I need to keep looking into this tomorrow. I have not yet figured out what is going on, but let me message you when I have more of an idea.

Comment by Jordi Serra Torrens [ 11/Jul/23 ]

Hi rohan.sharan@mongodb.com
I've been staring at the logs of the latest repro you posted. The CollectionUUIDMismatch with "actualCollection: null":

j5:cl1:s0:prim] | 2023-07-10T15:25:31.281+00:00 D1 ASSERT   23074   [conn80] "User assertion","attr":{"error":"CollectionUUIDMismatch{ db: \"test63_fsmdb0\", collectionUUID: UUID(\"d262857c-09cb-4704-a502-1388ef411142\"), expectedCollection: \"unsharded_coll_0\", actualCollection: null }: Collection UUID does not match that specified","file":"src/mongo/db/catalog/collection_uuid_mismatch.cpp","line":70}

appears to be expected, given the sequence of DDL operations performed by mongosync.

// Create collection 'test63_fsmdb0.mongosync.tmp.5911a15b-4a13-45da-9402-fc96746bf334'. It gets uuid 'd262857c-09cb-4704-a502-1388ef411142'
[j5:cl1:s0:prim] | 2023-07-10T15:25:05.341+00:00 I  STORAGE  20320   [conn467] "createCollection","attr":{"namespace":"test63_fsmdb0.mongosync.tmp.5911a15b-4a13-45da-9402-fc96746bf334","uuidDisposition":"generated","uuid":{"uuid":{"$uuid":"d262857c-09cb-4704-a502-1388ef411142"}},"options":{}}
 
// Rename 'test63_fsmdb0.mongosync.tmp.5911a15b-4a13-45da-9402-fc96746bf334' to 'test63_fsmdb0.unsharded_coll_0'
[j5:cl1:s0] | 2023-07-10T15:25:23.219+00:00 I  COMMAND  51803   [conn999] "Slow query","attr":{"type":"command","ns":"test63_fsmdb0.mongosync.tmp.5911a15b-4a13-45da-9402-fc96746bf334","appName":"mongosync, v: 1.4.1-43-g6a8dcfd3, i: shard-rs0, cl: dst, cm: 6a8dcfd3, g: go1.19.10, cp: gc, p: 00000000-0000-0000-0000-00000000","command":{"renameCollection":"test63_fsmdb0.mongosync.tmp.5911a15b-4a13-45da-9402-fc96746bf334","to":"test63_fsmdb0.unsharded_coll_0","dropTarget":false,"collectionUUID":{"$uuid":"d262857c-09cb-4704-a502-1388ef411142"},"writeConcern":{"w":"majority","j":true,"wtimeout":120000},"lsid":{"id":{"$uuid":"8738e0f2-1cfe-4397-82c2-38d288b5271a"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1689002722,"i":167}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"maxTimeMS":300000,"$db":"admin"},"numYields":0,"reslen":163,"readConcern":{"level":"local","provenance":"implicitDefault"},"cpuNanos":299726,"remote":"127.0.0.1:48968","protocol":"op_msg","durationMillis":228}
 
// Rename 'test63_fsmdb0.unsharded_coll_0' to 'test63_fsmdb0.mongosync.tmp.d262857c-09cb-4704-a502-1388ef411142'
[j5:cl1:s1] | 2023-07-10T15:25:25.494+00:00 I  COMMAND  51803   [conn904] "Slow query","attr":{"type":"command","ns":"test63_fsmdb0.unsharded_coll_0","appName":"mongosync, v: 1.4.1-43-g6a8dcfd3, i: shard-rs0, cl: dst, cm: 6a8dcfd3, g: go1.19.10, cp: gc, p: 00000000-0000-0000-0000-00000000","command":{"renameCollection":"test63_fsmdb0.unsharded_coll_0","to":"test63_fsmdb0.mongosync.tmp.d262857c-09cb-4704-a502-1388ef411142","dropTarget":false,"collectionUUID":{"$uuid":"d262857c-09cb-4704-a502-1388ef411142"},"writeConcern":{"w":"majority","j":true,"wtimeout":120000},"lsid":{"id":{"$uuid":"8738e0f2-1cfe-4397-82c2-38d288b5271a"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1689002725,"i":106}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"maxTimeMS":300000,"$db":"admin"},"numYields":0,"reslen":163,"readConcern":{"level":"local","provenance":"implicitDefault"},"cpuNanos":295560,"remote":"127.0.0.1:39128","protocol":"op_msg","durationMillis":140}
 
// Drop 'test63_fsmdb0.mongosync.tmp.d262857c-09cb-4704-a502-1388ef411142' (which had uuid d262857c-09cb-4704-a502-1388ef411142)
[j5:cl1:s0] | 2023-07-10T15:25:26.573+00:00 I  COMMAND  51803   [conn999] "Slow query","attr":{"type":"command","ns":"test63_fsmdb0.mongosync.tmp.d262857c-09cb-4704-a502-1388ef411142","appName":"mongosync, v: 1.4.1-43-g6a8dcfd3, i: shard-rs0, cl: dst, cm: 6a8dcfd3, g: go1.19.10, cp: gc, p: 00000000-0000-0000-0000-00000000","command":{"drop":"mongosync.tmp.d262857c-09cb-4704-a502-1388ef411142","collectionUUID":{"$uuid":"d262857c-09cb-4704-a502-1388ef411142"},"writeConcern":{"w":"majority","j":true,"wtimeout":120000},"lsid":{"id":{"$uuid":"8738e0f2-1cfe-4397-82c2-38d288b5271a"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1689002726,"i":140}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"maxTimeMS":300000,"$db":"test63_fsmdb0"},"numYields":0,"reslen":163,"readConcern":{"level":"local","provenance":"implicitDefault"},"cpuNanos":295124,"remote":"127.0.0.1:48968","protocol":"op_msg","remoteOpWaitMillis":62,"durationMillis":62}
 
// From this point, there's no collection with uuid 'd262857c-09cb-4704-a502-1388ef411142' anymore. So it is expected that commands with expectedUUID=d262857c-09cb-4704-a502-1388ef411142 will fail with CollectionUUIDMismatch(...actualCollection=null).
[j5:cl1:s0:prim] | 2023-07-10T15:25:31.281+00:00 D1 ASSERT   23074   [conn80] "User assertion","attr":{"error":"CollectionUUIDMismatch{ db: \"test63_fsmdb0\", collectionUUID: UUID(\"d262857c-09cb-4704-a502-1388ef411142\"), expectedCollection: \"unsharded_coll_0\", actualCollection: null }: Collection UUID does not match that specified","file":"src/mongo/db/catalog/collection_uuid_mismatch.cpp","line":70}

Given the above sequence of events, this does not look like a repro of the bug jordi.olivares-provencio@mongodb.com fixed in this ticket. Something else must be going on. rohan.sharan@mongodb.com, could you check if the above actions are consistent with what mongosync should be doing?

Comment by Rohan Sharan [ 10/Jul/23 ]

I think I am seeing an instance of this reoccuring: 

https://parsley.mongodb.com/resmoke/f5b7d0b751acd20c94aa6ee9fb3e4d9d/all?bookmarks=0,2267132,2279719,2334854&filters=000unsharded_coll_0.%2AactualCollection%253A%2520null,000create.%2Atest63_fsmdb0.%2Aunsharded_coll_0,000test63_fsmdb0.%2Aunsharded_coll_0.%2ADDL%2520applier,000DDL%2520applier%2520received%2520drop%2520event%2520for%2520a%2520collection,1009273a981-84cd-4666-a363-012762cbfa38 

It is happening on 
"version":"7.0.0-rc6-48-ga732e55"
Which is after the backport 5c2d7 was merged.

Comment by Githook User [ 05/Jul/23 ]

Author:

{'name': 'Jordi Olivares Provencio', 'email': 'jordi.olivares-provencio@mongodb.com', 'username': 'jordiolivares'}

Message: SERVER-77685 Move collection uuid mismatch check after shard check
Branch: master
https://github.com/mongodb/mongo/commit/5ba0cfd0f6dc81c7c3d9083d85f2b0bc6337d65d

Comment by Jordi Olivares Provencio [ 28/Jun/23 ]

We're meeting tomorrow during triage where we'll decide on it. In theory if it all goes well and the ticket gets merged this sprint the backport will be performed quickly after. However, we've identified that the logic is also flawed with the lock-free reads path which uses different checks. As a result it's taking a bit longer than expected since the logic has also diverged now between master and 7.0.

Comment by Rohan Sharan [ 28/Jun/23 ]

Do we have a timeline for when this will be fixed and backported to v7.0?

Comment by Jordi Olivares Provencio [ 23/Jun/23 ]

I'm starting to agree with gregory.noma@mongodb.com on this. Previously (6.0) the AutoGetCollection didn't perform a UUID check, leaving it to happen after acquiring it and all the sharding checks have occurred.

As of SERVER-68337 however, this check is performed in the AutoGetter before the other sharding checks. As this occurs in 7.0 I'll request a backport in order to fix this.

Comment by Jordi Olivares Provencio [ 23/Jun/23 ]

rohan.sharan@mongodb.com Sorry for the delay, the team got pulled into other tickets but we are now getting to this issue.

Comment by Rohan Sharan [ 22/Jun/23 ]

Is there any update here? This is still blocking mongosync 7.0 support.

Comment by Gregory Noma [ 13/Jun/23 ]

We may just need to swap the order of the collection UUID mismatch check and the shard version check in AutoGetCollection

Comment by Rohan Sharan [ 01/Jun/23 ]

In talking to Gregory, he mentioned that this might be related to how the sharding API never guaranteed that the primary shard of a sharded cluster would have info on all collections in the cluster (something may have changed that made this no longer the case in v7.0).

Generated at Thu Feb 08 06:36:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.