[SERVER-41532] Mongos can fail with "a non-retryable snapshot error" and "Unable to read from a snapshot due to pending collection catalog changes" Created: 05/Jun/19 Updated: 22/Feb/22 Resolved: 16/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Robert Stam | Assignee: | Randolph Tan |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||
| Steps To Reproduce: | I think the repro is:
Note: part of the scenario might be creating the collection on one mongos and executing the transaction on a different mongos, but I'm not sure. |
||||||||||||||||||||||||||||||||
| Sprint: | Execution Team 2020-01-27, Execution Team 2020-02-10, Execution Team 2020-03-09, Sharding 2020-04-20 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Description |
|
The full error message is:
The error message suggests retrying the operation, and in fact the operation DOES succeed after a retry. However, this seems like an unnecessary error to return to the application. The server should just block until the pending collection catalog changes have been completed and then execute the operation without error. |
| Comments |
| Comment by Tiansheng Ren [ 27/Mar/21 ] |
|
I also have this problem, I first create a table, then insert a data, sleep 1s, open the transaction transaction insert data is still wrong
mongd single node 4.2.6 replica-set |
| Comment by Dianna Hohensee (Inactive) [ 26/Feb/20 ] |
|
Looking at the v4.3 and v4.2 code, we consistently return a SnapshotUnavailable error on the mongod whenever this happens. The mongod cannot retry on its own with a newer timestamp because it's a snapshot readConcern at a time selected by the mongos for all shards. Part of the msg included in the ticket description says "non-retryable snapshot error". I'm unsure whether this is the current state of the code, or old state of the code. Oleg's most recent comment with a log message does not have "non-retryable snapshot error". Options: The in-memory catalog is not multi-versioned, so we cannot do anything right now to stop the error from happening when a read comes in at a time earlier than the most modification of that collection. Since the recent modification of the collection was creating it, even in a multi-version in-memory catalog world, the insert is not going to find a collection at that point-in-time, and I doubt we'll create the collection at that time since the collection already exists (hypothetical hand-wavy...). |
| Comment by Dianna Hohensee (Inactive) [ 24/Feb/20 ] |
|
oleg.pudeyev My first thought is that the cluster time of the mongos used to route the insert request is stale relative to the mongod targeted for the insert. I believe 'snapshot' read concern means that the mongos picks a time at which to read, which is the mongos' cluster time, and then the request forwarded to the mongod will be to open a snapshot at that selected time. So I can see the test failing if the create goes through mongos1, pushing the cluster time forward for mongos1 and mongod1; and then the insert goes through mongos2, getting an old cluster time, and reaching mongod1 with a too-old timestamp. |
| Comment by Githook User [ 22/Jul/19 ] |
|
Author: {'name': 'Oleg Pudeyev', 'email': '39304720+p-mongo@users.noreply.github.com', 'username': 'p-mongo'}Message:
|
| Comment by Githook User [ 22/Jul/19 ] |
|
Author: {'name': 'Oleg Pudeyev', 'email': '39304720+p-mongo@users.noreply.github.com', 'username': 'p-mongo'}Message:
|
| Comment by Oleg Pudeyev (Inactive) [ 19/Jul/19 ] |
|
I am getting this error in about 50% of the test runs of this test on 4.2 and 4.3 sharded: https://github.com/mongodb/specifications/blob/master/source/transactions-convenient-api/tests/transaction-options.yml Note that this test's set up arranges for an empty collection (we drop and create the collection with w: majority), and the operations are inserts. On the third or fourth test of a single run, I get this: #<Mongo::Error::OperationFailure: Transaction 0560f3da-5f70-4582-927a-890c45fa2126:1 was aborted on statement 0 due to: a non-retryable snapshot error :: caused by :: Encountered error from localhost:14442 during a transaction :: caused by :: Unable to read from a snapshot due to pending collection catalog changes; please retry the operation. Snapshot timestamp is Timestamp(1563576172, 39). Collection minimum is Timestamp(1563576172, 47) (246)> In about half the cases, this error doesn't happen and the entire run succeeds. I tried, as a workaround, to insert and delete some data using the transaction api helper (which does retry), thus forcing the error to happen during the setup phase where the operation can be retried, but I have not succeeded - the inserts in the setup never fail but the actual insert in the test still fails (half the time). I also tried to perform the insert & delete on each mongos separately. This did not appear to change the situation. I then tried to perform 5 inserts & deletes on each mongos prior to running the tests. This seems to bring the failure rate close to 0% locally but doesn't affect Evergreen's failure rate significantly. Note also that 14442 is a mongod node behind one of the mongoses that the driver communicates with (this is a sharded cluster of replica sets). Example output with the failure: https://gist.github.com/p-mongo/a2f09d862ef82e6e538ee8fe4d1db5e5 Locally the entire test file succeeds about half the time, but it looks like in Evergreen the failure rate is close to 100%. |
| Comment by Eric Milkie [ 05/Jul/19 ] |
|
That link points to a line that says "Encountered error from localhost:27217 during a transaction :: caused by :: Read timestamp Timestamp(1562321126, 31) is older than the oldest available timestamp..". That message is not the error that this ticket is concerned with. In order for the problem to be the same as this ticket, the error text would need to say "Encountered error from localhost:27019 during a transaction :: caused by :: Unable to read from a snapshot due to pending collection catalog changes; please retry the operation.". |
| Comment by Dmitry Lukyanov (Inactive) [ 05/Jul/19 ] |
|
milkie, I've also faced exactly the same error: https://evergreen.mongodb.com/task_log_raw/dot_net_driver_unsecure_tests__version~latest_os~windows_64_topology~sharded_cluster_auth~noauth_ssl~nossl_test_patch_385bb3b0be82ba8c053580acf86d327738a13a37_5d1e3063d6d80a40ea74adcf_19_07_04_16_59_16/4?type=T#L2388 We think that these are the same issues which are just thrown from a slightly different place in the test. |
| Comment by Eric Milkie [ 05/Jul/19 ] |
|
Reclosing this ticket. Dimtry, the error message I see appearing in your patch build does not match the message for this ticket. It appears that |
| Comment by Dmitry Lukyanov (Inactive) [ 04/Jul/19 ] |
|
Hello milkie, probaby I have the same situation: https://evergreen.mongodb.com/task_log_raw/dot_net_driver_secure_tests__version~latest_os~windows_64_topology~sharded_cluster_auth~auth_ssl~ssl_test_patch_385bb3b0be82ba8c053580acf86d327738a13a37_5d1d06492fbabe1ad8054e0c_19_07_03_19_47_21/1?type=T#L2304 Even though rstam's workaround has been applied. I'm continuing an investigation of it. Probably I will reopen this ticket. |
| Comment by Eric Milkie [ 03/Jul/19 ] |
|
Ok – please reopen this if it starts becoming a problem again. |
| Comment by Robert Stam [ 03/Jul/19 ] |
|
It hasn't failed for me recently. This was happening to me on a branch I was working on, and I think since I rebased on master it stopped happening, but I'm not sure what exactly from all the changes I picked up from master might have changed things. |
| Comment by Eric Milkie [ 28/Jun/19 ] |
|
Is the test suite still failing? There are various things we can try to avoid having this particular snapshot error occur for driver tests. |
| Comment by Robert Stam [ 06/Jun/19 ] |
|
Currently the C# driver test suite does not pass due to this server issue. Is there any workaround suggested? |