[SERVER-43889] Distinguish between a retryable write and a transaction when failing a command Created: 03/Oct/19  Updated: 29/Oct/23  Resolved: 18/Feb/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.2.6, 4.3.4, 4.0.19

Type: Improvement Priority: Major - P3
Reporter: Marek Kresnicki Assignee: Ali Mir
Resolution: Fixed Votes: 0
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
is related to SERVER-35411 Add abortCause argument to abortActiv... Backlog
is related to DRIVERS-699 Raise an actionable error message whe... Implementing
Backwards Compatibility: Minor Change
Backport Requested:
v4.2, v4.0
Sprint: Repl 2020-02-10, Repl 2020-02-24
Participants:

 Description   

Currently, most txnNumber errors refer only to transactions, even if the error pertains to a retryable write. This can create a poor user experience for someone who is only using retryable writes.

We should audit these types of errors on the server and determine whether the command was a part of a retryable write or a transaction. Depending on the answer, we can return one of the following errors:

"Cannot start transaction X on session <UUID> because a newer transaction or retryable write with txnNumber Y has already started on this session."

"Retryable write with txnNumber X is prohibited on session <UUID> because a newer transaction or retryable write with txnNumber Y has already started on this session."

ORIGINAL POST:

Code that reproduces the issue:

        private static async Task TestSingleSession(MongoClient client)
        {
            var col = client.GetDatabase("_test").GetCollection<BsonDocument>("concurrent");                        
            using (var session = await client.StartSessionAsync())
            {
                var ops = Enumerable.Range(0, 100)
                    .Select(async e =>
                    {
                        await col.ReplaceOneAsync(session, Builders<BsonDocument>.Filter.Eq("_id", e),
                            new BsonDocument
                            {
                                { "_id", e },
                                { "data", $"someData_{e}" },
                            },
                            new UpdateOptions { IsUpsert = true });
                    })
                    .ToArray();
                await Task.WhenAll(ops);
            }
        }

 

When running above code I'm getting the exception:

MongoCommandException: Command update failed: Cannot start transaction 1 on session c4fd4081-e0f0-40d6-84fc-cdfeccc74e3b - hf9LJrhq2lfp667gnURrBVE7MCUS1NJZYDmWlqfKWl0= because a newer transaction 3 has already started..

This is misleading as there's no explicit transaction in progress.

I know now that we should use either

  • ReplaceOneAsync overload without session
  • Create new session per call to ReplaceOneAsync with session

 

background: I've encountered that issue in production code where one method is being used with and without transaction - our solution was to simply create a intermediate overloaded method that creates a session for you and passes it into method that accepts the session.



 Comments   
Comment by Githook User [ 07/Apr/20 ]

Author:

{'name': 'Ali Mir', 'email': 'ali.mir@mongodb.com', 'username': 'ali-mir'}

Message: SERVER-43889 Distinguish between retryable write and transaction when failing a command

(cherry picked from commit eb284b042c71edf0eac445d3ceb79f7fdeabc5d1)
Branch: v4.0
https://github.com/mongodb/mongo/commit/95626d0eca8325488d354221b1ec5b1019743fa1

Comment by Githook User [ 31/Mar/20 ]

Author:

{'name': 'Ali Mir', 'email': 'ali.mir@mongodb.com', 'username': 'ali-mir'}

Message: SERVER-43889 Distinguish between retryable write and transaction when failing a command

(cherry picked from commit eb284b042c71edf0eac445d3ceb79f7fdeabc5d1)
Branch: v4.2
https://github.com/mongodb/mongo/commit/89a34f9f823a1efccef072cea0cc1c12fbfcfeea

Comment by Githook User [ 18/Feb/20 ]

Author:

{'name': 'Ali Mir', 'username': 'ali-mir', 'email': 'ali.mir@mongodb.com'}

Message: SERVER-43889 Distinguish between retryable write and transaction when failing a command
Branch: master
https://github.com/mongodb/mongo/commit/eb284b042c71edf0eac445d3ceb79f7fdeabc5d1

Comment by Pavithra Vetriselvan [ 09/Dec/19 ]

Hi eric.sedor, thank you for fixing the comments!

I do agree that we can provide a better user experience here but think it will require a joint effort between the Server and Drivers teams. To aid discussion, I am moving this conversation to email.

Comment by Eric Sedor [ 20/Nov/19 ]

Hi pavithra.vetriselvan,

Unfortunately I don't think that will help with the confusion this message causes... Is there any way to implicate the fact that the error is happening because a write is being retried? Ideally we could distinguish between

  • the server transparently retrying a write in an unsuccessful/non-recoverable way
    and
  • a user making a mistake when explicitly using transactions in a way they should not
Comment by Pavithra Vetriselvan [ 19/Nov/19 ]

eric.sedor, do you think changing the error message to what I suggested above is an adequate fix here?

Comment by Pavithra Vetriselvan [ 11/Nov/19 ]

Hi eric.sedor, I think SERVER-35411 focuses more on remembering why a transaction aborts so that we can more reliably deliver a ‘TransientTransactionError’ label to drivers.

We throw with the error mentioned in this ticket because we are trying to run an operation with a lower txnNumber than the current txnNumber. When running retryable writes concurrently on the same session, we can imagine a scenario where write 1 (txnNumber = 1) fails, write 2 (txnNumber = 2) starts and overwrites the txnNumber for the session, and write 1 (txnNumber = 1) retries.

Since we're explicitly checking the txnNumber, which is used for retryable writes and transactions, one solution could be changing the error message to:
"Cannot use txnNumber 1 on session <sessionID> because a newer txnNumber 3 is currently being used."

Comment by Eric Sedor [ 25/Oct/19 ]

Does SERVER-35411 help us be more specific here?

Comment by Eric Sedor [ 10/Oct/19 ]

Hi marek.kresnicki@gmail.com,

We wanted to let you know we are looking into this. Thanks in advance for your patience.

Comment by Marek Kresnicki [ 03/Oct/19 ]

Forgot to mention that this has been working fine in version 2.8.1

Generated at Thu Feb 08 05:04:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.