[SERVER-77685] Server can return CollectionUUIDMismatch with actual collection null if collection exists Created: 01/Jun/23 Updated: 29/Oct/23 Resolved: 06/Jul/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Rohan Sharan | Assignee: | Jordi Olivares Provencio |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||
| Backport Requested: |
v7.0
|
||||||||||||||||||||||||||||||||
| Sprint: | Execution EMEA Team 2023-06-26, Execution EMEA Team 2023-07-10 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Description |
|
In mongosync, I am seeing that in some cases, on v7.0+ and in sharded clusters, that Collection UUID errors are being returned like the following:
In these errors the actualCollection field is null, which should only happen if the collection hasn't been created yet, or has been dropped. However, the collection with this UUID should exist on the destination cluster and there is no drop that should be happening for the collection with that UUID. The following rough order of events happen in the BF I am seeing:
Note that 2, 3 and 4 don't necessarily happen one after the other (they can happen at different relative times because of the parallelism in mongosync). Because of step 3, we expect some CollectionUUIDMismatch errors, but in this scenario, I either expect it not to throw one, or I expect it to have the actual collection field be mongosync.tmp.<uuid>, and not null. The end result of this is that mongosync ends up not applying some CRUD events (typically inserts) and has data inconsistency between the source and destination clusters. This is blocking mongosync v7.0 support, and is causing some BFs on our waterfall (because we run some integration tests on latest). |
| Comments |
| Comment by Rohan Sharan [ 12/Jul/23 ] | |||||||||||||||
|
It seems that the problem was actually just a new test added in v7.0 that is exposing a pre-existing limitation of mongosync. Sorry for the confusion. | |||||||||||||||
| Comment by Rohan Sharan [ 11/Jul/23 ] | |||||||||||||||
|
I need to keep looking into this tomorrow. I have not yet figured out what is going on, but let me message you when I have more of an idea. | |||||||||||||||
| Comment by Jordi Serra Torrens [ 11/Jul/23 ] | |||||||||||||||
|
Hi rohan.sharan@mongodb.com
appears to be expected, given the sequence of DDL operations performed by mongosync.
Given the above sequence of events, this does not look like a repro of the bug jordi.olivares-provencio@mongodb.com fixed in this ticket. Something else must be going on. rohan.sharan@mongodb.com, could you check if the above actions are consistent with what mongosync should be doing? | |||||||||||||||
| Comment by Rohan Sharan [ 10/Jul/23 ] | |||||||||||||||
|
I think I am seeing an instance of this reoccuring: It is happening on | |||||||||||||||
| Comment by Githook User [ 05/Jul/23 ] | |||||||||||||||
|
Author: {'name': 'Jordi Olivares Provencio', 'email': 'jordi.olivares-provencio@mongodb.com', 'username': 'jordiolivares'}Message: | |||||||||||||||
| Comment by Jordi Olivares Provencio [ 28/Jun/23 ] | |||||||||||||||
|
We're meeting tomorrow during triage where we'll decide on it. In theory if it all goes well and the ticket gets merged this sprint the backport will be performed quickly after. However, we've identified that the logic is also flawed with the lock-free reads path which uses different checks. As a result it's taking a bit longer than expected since the logic has also diverged now between master and 7.0. | |||||||||||||||
| Comment by Rohan Sharan [ 28/Jun/23 ] | |||||||||||||||
|
Do we have a timeline for when this will be fixed and backported to v7.0? | |||||||||||||||
| Comment by Jordi Olivares Provencio [ 23/Jun/23 ] | |||||||||||||||
|
I'm starting to agree with gregory.noma@mongodb.com on this. Previously (6.0) the AutoGetCollection didn't perform a UUID check, leaving it to happen after acquiring it and all the sharding checks have occurred. As of | |||||||||||||||
| Comment by Jordi Olivares Provencio [ 23/Jun/23 ] | |||||||||||||||
|
rohan.sharan@mongodb.com Sorry for the delay, the team got pulled into other tickets but we are now getting to this issue. | |||||||||||||||
| Comment by Rohan Sharan [ 22/Jun/23 ] | |||||||||||||||
|
Is there any update here? This is still blocking mongosync 7.0 support. | |||||||||||||||
| Comment by Gregory Noma [ 13/Jun/23 ] | |||||||||||||||
|
We may just need to swap the order of the collection UUID mismatch check and the shard version check in AutoGetCollection | |||||||||||||||
| Comment by Rohan Sharan [ 01/Jun/23 ] | |||||||||||||||
|
In talking to Gregory, he mentioned that this might be related to how the sharding API never guaranteed that the primary shard of a sharded cluster would have info on all collections in the cluster (something may have changed that made this no longer the case in v7.0). |