[SERVER-31691] Change streams UUID mismatch uasserts with error code that causes invariant Created: 24/Oct/17 Updated: 30/Oct/23 Resolved: 07/Nov/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.0-rc4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Nicholas Zolnierz | Assignee: | Bernard Gorman |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||
| Sprint: | Query 2017-11-13 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Description |
|
Found while writing the tests for SERVER-31392, the DocumentSourceChangeStream hits a uassert when the UUID in the oplog entry mismatches the UUID in the scoped collection metadata. Returning ErrorCodes::StaleConfig results in a dynamic_cast a few layers up, and eventually an invariant. I am not sure we should hit this uassert at all (a separate discussion), but for sure it doesn't seem like we should be returning this status. |
| Comments |
| Comment by Sam Johnson [ 23/Jul/18 ] | ||
|
Awesome. that fixed the problem. Thank you. | ||
| Comment by Bernard Gorman [ 22/Jul/18 ] | ||
|
sjohnson540, many thanks for providing that additional information. After conferring with some of my colleagues, we believe that you are likely encountering the issue described in To resolve the issue, please try the following: 1. Connect directly to each shard's Primary, and issue the command below to refresh the Primary's in-memory collection state cache:
2. Unfortunately, this will only refresh the Primary's cache, and forceRoutingTableRefresh can only be run on the Primary. To refresh the Secondaries' caches, please restart each Secondary one by one. This should fully resolve the problem without causing any shard downtime. Once this is done, you should not encounter the error again at any point in the future, since all mongoDs will have an up-to-date cache including the UUIDs generated for each collection during the upgrade, and UUIDs will automatically be cached for any future collections created in 3.6 featureCompatibilityVersion. Best regards, | ||
| Comment by Sam Johnson [ 16/Jul/18 ] | ||
|
@bernard.gorman I am not Basit, but I have run into what seems to be the exact same issue, I can answer the questions:
I see this issue is closed. I am on version 3.6.5 and still encountering this issue. | ||
| Comment by Bernard Gorman [ 06/Apr/18 ] | ||
|
Hi basitmustafa! Thank you for bringing this to our attention. You're correct that the fix for this issue was included in the GA release of 3.6.0, but that fix related only to the behaviour of the server when this scenario was encountered. Before this ticket, a UUID mismatch resulted in an invariant failure, which always causes the server to shut down as a protective measure. The fix implemented here was to change this to an assertion, which simply aborts the current operation and returns an error to the client. I'm surprised to hear that you're seeing this exception on every change from one particular collection. The assertion is thrown here in 3.6.3, in the collectionDocumentKeyFields method. This is invoked the first time we encounter a change on the target collection; having extracted the UUID from the change's oplog entry, we then consult our cached collection metadata to determine whether the collection is sharded and, if so, what the shard key is. The only scenario in which I've been able to force our tests to hit this assertion is if (1) a $changeStream is opened on a collection, (2) some writes are performed but no changes have yet been retrieved by the stream, (3) the collection is dropped, recreated (causing it to acquire a new UUID) and sharded, and (4) a getMore is issued on the $changeStream. This pulls the first of the changes, which still has the old UUID, through the pipeline; we attempt to look up the shard key, realise that the UUIDs no longer match, and throw the exception. However, it seems unlikely that such an artificial sequence of events is happening in your cluster. Can you let me know the following:
Best regards, | ||
| Comment by Basit Mustafa [ 27/Mar/18 ] | ||
|
I think we are seeing this behavior in the wild in a production release (although I am not intimately familiar with the code base, we are simply a production user, not yet contributors). On a sharded collection (vtDataWarehouse.VTDataObject) any change events result in an error, but only on this specific collection. Other collections, both sharded and unsharded, things work fine on the same system, same connection, even. Any advice on a workaround before the supposed 3.7 change around the uassert behavior (although this was supposed fixed for 3.6 if I am understanding correctly)? error: MongoError: Collection vtDataWarehouse.VTDataObject UUID differs from UUID on change stream operations Server is 3.6.3 gitVersion 9586e557d54ef70f9ca4b43c26892cd55257e1a5 on ubuntu 1604 x86_64 | ||
| Comment by Bernard Gorman [ 07/Nov/17 ] | ||
|
spencer: having looked into this, I'm not convinced that uasserting here is correct or desirable behaviour. I've pushed a fix to avoid the invariant failure and have opened | ||
| Comment by Githook User [ 07/Nov/17 ] | ||
|
Author: {'name': 'Bernard Gorman', 'username': 'gormanb', 'email': 'bernard.gorman@gmail.com'}Message: | ||
| Comment by Spencer Brody (Inactive) [ 27/Oct/17 ] | ||
|
At this point I think we shouldn't worry about trying to properly trigger a metadata refresh. The only case it can possibly be an issue is after an unsharded collection becomes sharded, which has other known issues right now anyway, and it's very easy to work around the issue (just do any versioned read or write against the shard, which will kick off a metadata refresh). So I think just using a normal uassert code should be fine for now. | ||
| Comment by Spencer Brody (Inactive) [ 27/Oct/17 ] | ||
|
Yes, that is a known issue and should be fine. If you haven't even returned any results yet then there should really be no problem in just re-establishing a brand new cursor. So maybe just the test you were working on just needs to be updated to make sure it has processed at least one result before dropping the collection? Although we also need to stop the uassert from triggering an invariant, but I think we're usasserting at the right time at least. | ||
| Comment by Bernard Gorman [ 27/Oct/17 ] | ||
|
spencer, I think that makes sense. From the code, we do check whether we've already obtained and stored the shard key fields in _documentKeyFields before consulting the metadata. However, that check only happens the first time we pull documents through the transform stage. So if the UUID changes between the time that the cursor is established on the shard and the point where it begins returning results, then I can see how we would hit this uassert. | ||
| Comment by Spencer Brody (Inactive) [ 27/Oct/17 ] | ||
|
I think using the StaleConfig code is wrong, and we probably don't have the right information available to throw StaleConfigException properly. What I want to understand is if we've already saved the shard key onto the cursor and checked the UUID, why are we hitting this uassert at all on the getMore? The idea behind how we extract the shard key for insert entries was that for a sharded changestream, when the mongod processes its very first notification it will check the collection metadata, extract the shard key, confirm the UUID of the collection metadata matches the UUID of the notification, then store the shard key on the cursor. From that point on, for the lifetime of the cursor, the mongod shouldn't have to be checking the CollectionMetadata at all since the cursor already knows the shard key, so even if the collection is dropped it should be able to continue processing entries until the cursor catches up to the drop entry and returns an "invalidate". The only case that should fail is if the cursor is lost and the stream has to be resumed with a ResumeToken, making a new cursor. The new cursor will try to load the CollectionMetadata but won't find any information for the collection with that name. In that case we should return a sane error back to the user. It is a known issue that we can't resume after a collection drop, but so long as we're using the same cursor it should work fine. | ||
| Comment by Bernard Gorman [ 27/Oct/17 ] | ||
|
See dupe ticket | ||
| Comment by Nicholas Zolnierz [ 24/Oct/17 ] | ||
|
As far as I can tell, the collection metadata's UUID is valid at the time of issuing the getNext(), which would be after the collection is sharded. If I create a new change stream cursor after sharding the collection, everything works fine and the collection's UUID is unchanged. The issue seems to be that the UUID associated with the change stream cursor created before sharding the collection is incorrect/stale/!missing. At a first glance, perhaps this check was meant to catch it? | ||
| Comment by Spencer Brody (Inactive) [ 24/Oct/17 ] | ||
|
I has assumed that it was due to there being no UUID in the CollectionMetadata since the config servers don't track any information about unsharded collections, but I'll leave nicholas.zolnierz to confirm. | ||
| Comment by Esha Maharishi (Inactive) [ 24/Oct/17 ] | ||
|
I'm confused why Nick's case causes a UUID mismatch. The UUID shouldn't change when the collection becomes sharded? That sounds like a more serious issue... | ||
| Comment by Spencer Brody (Inactive) [ 24/Oct/17 ] | ||
|
Actually, on further consideration this may not actually be as big a problem in practice as I thought, since I don't think we look for the CollectionMetadata until we process the first actual changeStream notification. By the time there is any actual notifications to return, there must have been at least 1 insert to the collection. Since inserts are versioned, that must have caused the shard to refresh its CollectionMetadata at that point. So that means having a UUID mismatch between a real collection UUID in the CollectionMetadata and the UUID in the oplog entry probably can't happen. So that means the only case that's an issue is the case Nick describes, after an unsharded collection becomes sharded. In that case we will still need to kick off a refresh of the CollectionMetadata somehow. | ||
| Comment by Spencer Brody (Inactive) [ 24/Oct/17 ] | ||
|
I believe the intent of that code was to handle the case where the shard has outdated CollectionMetadata - corresponding to a previous version of the collection - when the changeStream is established. Since changeStreams are unversioned, if the first operation you perform on the newly sharded collection is opening a change stream the metadata won't be refreshed and you'll get a UUID mismatch between the sharding CollectionMetadata and the on-disk catalog. If that happens we'd like to refresh the CollectionMetadata and if the UUIDs now match continue without erroring to the user. | ||
| Comment by Nicholas Zolnierz [ 24/Oct/17 ] | ||
|
esha.maharishi, yeah I changed to error code InvalidUUID for now to unblock the testing, but StaleEpoch works as well. To give some more context on what's causing this, these are the steps: 1. Open change stream cursor on unsharded collection I'm not sure if this is expected to work or not, seems like a perfectly valid use case to me. | ||
| Comment by Esha Maharishi (Inactive) [ 24/Oct/17 ] | ||
|
Ah yes, we should never uassert with a StaleConfig error code. In general, we should instead throw a StaleConfigException, so that it can include the version wanted and version received. However, I think this particular spot should uassert a different error code, probably StaleEpoch, since it's a UUID mismatch, which is equivalent to what an epoch mismatch meant before we had UUIDs. However, note that throwing a StaleEpoch will not cause the shard to refresh. What is supposed to happen if the change streams cursor's UUID does not match what's on the CollectionMetadata? |