[SERVER-66974] Server responding with unexpected error 246 (SnapshotUnavailable) Created: 02/Jun/22 Updated: 27/Oct/23 Resolved: 03/Jun/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jeffrey Yemin | Assignee: | Daniel Gottlieb (Inactive) |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
|
||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
The Java driver has three similar tests (for legacy, sync, and async APIs) that all occasionally fail with the server error:
The tests all do the following:
Expected results: the insert within a transaction succeeds Questions:
Linked BF ticket for Java driver: |
| Comments |
| Comment by Daniel Gottlieb (Inactive) [ 03/Jun/22 ] | ||||||||||||||||||||||||||||||
Unfortunately the server behavior is unrelated to the requests write concern – the ingredients to hit this state is more pervasive for real users than is being suggested. Anything that writes to the oplog creates "holes" which can result in SnapshotUnavailable for other clients. Even if the collection namespaces are unrelated. Enabling causal consistency is the recommended solution for anyone using transactions. | ||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 03/Jun/22 ] | ||||||||||||||||||||||||||||||
|
That's a great analysis, and it points me towards what the root cause is. While I claimed that there are no concurrent tests, in reality there is a sneaky driver feature that can introduce concurrency: unacknowledged (w: 0) writes. And in fact there is another test in the same class, that uses the same collection, which does in fact execute an unacknowledged write and doesn't have a mechanism to ensure that the server actually applied that write before the end of the test. I think I can work around that now that I understand it better. Feel free to keep this ticket open if there is anything that you think could be improved on the server to improve the user experience (while unacknowledged writes are (hopefully) a rare thing in practice, there may be other situations where normal concurrent writes could encounter this error). | ||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 03/Jun/22 ] | ||||||||||||||||||||||||||||||
I can see in the logs something concurrent is happening – maybe not another test. At the time of the error there are two connections logging messages. conn4629 which gives the error and conn6231 which is making this collection:
I can see creating that collection lost the race in a write-write conflict:
That namespace is coincidentally the same one conn4629 is performing an insert against. The SnapshotUnavailable log line you have sits between those two log lines. I hypothesize that if we had the data files for this test case we would see that Timestamp(1654118095, 85) does not exist in the oplog. I believe the race that resulted in SnapshotUnavailable was:
To summarize: a concurrent writer from somewhere opened up a hole. Removing the concurrent writer would have fixed this case, but is not a robust solution (for example, the noop writer in the server can cause this problem). Using causal consistency is a robust solution. This would resolve to the following interleaving in the server:
| ||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 02/Jun/22 ] | ||||||||||||||||||||||||||||||
|
I think this is the log message that is relevant:
It's in .evergreen/orchestration/db/mongod-jonfftty/mongod.log. The others are indeed red herrings due to our extensive use of failpoints for testing. | ||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 02/Jun/22 ] | ||||||||||||||||||||||||||||||
|
Re: logs, there are multiple mongod logs – not sure which one is relevant. Searching for SnapshotUnavailable gives me something I'm not sure how to interpret:
I assume I'm just seeing a red herring looking at the wrong logs/time period? | ||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 02/Jun/22 ] | ||||||||||||||||||||||||||||||
Fundamentally the problem arises because we don't have perfect detail on historical information on collection state. Reading at a timestamp (read: "in the past") may discover, for example, that we don't know which indexes existed at that time and were fully built/consistent. This barrier also applies to collection creation.
A quick thing to try is turning on causal consistency. Assuming that still meets the goal of whats being tested. I'm assuming the operation time in the response from the "create collection" request matches the timestamp value in Collection minimum is Timestamp(...). Using causal consistency/afterClusterTime will have the server internally wait, avoiding the SnapshotUnavailable error. I'm surprised this would have started recently. Which might be something worth server understanding and making sure we're behaving as expected. | ||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 02/Jun/22 ] | ||||||||||||||||||||||||||||||
|
Also, if you follow the links from the JAVA BF ticket, you'll find the full server logs attached to the task. | ||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 02/Jun/22 ] | ||||||||||||||||||||||||||||||
|
Here are (roughly) the raw commands (obtained from driver logs):
There are no concurrent tests running. | ||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 02/Jun/22 ] | ||||||||||||||||||||||||||||||
|
Hey Jeff, mind adding details around using causal consistency and/or any write concern/read concerns send along with each of the requests? Though I think RC is not technically relevant. Also can you confirm if there are other concurrent test threads performing writes against the mongod? |