[CSHARP-3301] Multi-thread Transaction Failure for Sharded Cluster Created: 07/Jan/21  Updated: 27/Oct/23  Resolved: 11/Jan/21

Status: Closed
Project: C# Driver
Component/s: Transactions
Affects Version/s: 2.11.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: fini sky Assignee: Dmitry Lukyanov (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

MongoDB Sharded Cluster version: 4.2.2-ent
OS: Ubuntu 16.04



 Description   

5 threads to execute transactions in parallel and encounter lots of 251 errors:

MongoCommandException, 251, "NoSuchTransaction", "Command insert failed: cannot continue txnId 35 for session 618e6cd1-4db1-40ea-8b22-6386e204c36b - xxx with txnId 36"

 

Reproduce code:

public class TransactionTest
{
    private const string DatabaseName = "PressureTest";
    private const string CollectionName = "Test";
    public const string ConnectionString = "";
    public MongoClient GetMongoClient(int timeout = 5)
    {
        var clientSettings = MongoClientSettings.FromConnectionString(ConnectionString);
        clientSettings.ConnectTimeout = TimeSpan.FromSeconds(5);
        clientSettings.ServerSelectionTimeout = TimeSpan.FromSeconds(timeout);
        clientSettings.AllowInsecureTls = true;
        var mongoClient = new MongoClient(clientSettings);
        return mongoClient;
    }
 
    public async Task TestTransactionAsync()
    {
        var client = GetMongoClient();
        var tasks = new List<Task>();
        for (int i = 0; i < 5; ++i)
        {
            //var client = GetMongoClient(i + 5);
            tasks.Add(DoAsync(client));
        }
        await Task.WhenAll(tasks);
    }
 
    private async Task DoAsync(IMongoClient mongoClient)
    {
        Console.WriteLine("Client hashcode: " + mongoClient.GetHashCode());
        var collection = mongoClient.GetDatabase(DatabaseName).GetCollection<BsonDocument>(CollectionName);
 
        while (true)
        {
            var uuid1 = Guid.NewGuid().ToString("N").Substring(24);
            var uuid2 = Guid.NewGuid().ToString("N").Substring(24);
            try
            {
                using (var session = await mongoClient.StartSessionAsync())
                {
                    session.StartTransaction();
                    await collection.InsertOneAsync(session, new BsonDocument("Uuid", uuid1));
                    await collection.InsertOneAsync(session, new BsonDocument("Uuid", uuid2));
 
                    await session.CommitTransactionAsync();
                }
                Console.WriteLine($"[{uuid1}] [{uuid2}]");
            }
            catch (Exception e)
            {
                Console.WriteLine("$$$ " + e.Message);
            }
        }
    }
}

 

If change the thread to 1, no error happens.

 

If not reuse the mongoClient by changing TestTransactionAsync(): create a dedicated mongoClient for each thread, no error happens:

public async Task TestTransactionAsync()
{
    var tasks = new List<Task>();
    for (int i = 0; i < 5; ++i)
    {
        var client = GetMongoClient(i + 5);
        tasks.Add(DoAsync(client));
    }
    await Task.WhenAll(tasks);
}

The above modification intentionally passes different ServerSelectionTimeout value to prevent mongoclient from reusing. Refer to: https://mongodb.github.io/mongo-csharp-driver/2.11/reference/driver/connecting/#mongo-client

multiple MongoClient instances created with the same settings will utilize the same connection pools underneath.

The document suggests re-use mongoclient by store it in a global place. However, a singleton mongoclient leads to parallel transaction failure.



 Comments   
Comment by Dmitry Lukyanov (Inactive) [ 11/Jan/21 ]

Thanks finiskygarden@gmail.com for your report!

Comment by fini sky [ 11/Jan/21 ]

I found the root cause: the loadbalancer in front of mongos. Since there are 2 mongos instances behind the same stateless kubernetes service, a transaction may not be executed on the same mongos throughout its lifetime.

 

I'll expose every mongos instance seperately and change the connection string.

 

Thanks very much for Dmitry Lukyanov's kindly help! 

Comment by fini sky [ 10/Jan/21 ]

Console app environment: .net core 2.2 (2.2.8)

Comment by fini sky [ 09/Jan/21 ]

Thanks Dmitry Lukyanov for your reply!

  1. Connection string: "mongodb://username:password@ip:27017/?authSource=admin&ssl=true"
  2. Yes, driver 2.11.5. I also tried 2.11.4, the same problem. However, seems that 2.11.5 is a little bit better than 2.11.4
  3. Now I updated the cluster to 4.4.3-ent, the problem is still there. I deploy the cluster using MongoDB Ops Manager/Kubernetes Operator. 2 shards (each with 3 mongods), 3 config servers, 2 mongos (expose to public by a kubernetes service instead of nodeport)
  4. Deployed by the Ops Manager, it's not that easy to try windows
  5. So far, both Int and Prod environments have this issue
  6. Almost constantly. Happen immediately. Below is a recent running log
  7. To reproduce the issue, try to increase the thread number (eg. 10, 20 or more)

 

Log with 10 threads (driver 2.11.5, server 4.4.3-ent):

Client hashcode: 33675143
Client hashcode: 33675143
Client hashcode: 33675143
Client hashcode: 33675143
Client hashcode: 33675143
Client hashcode: 33675143
Client hashcode: 33675143
Client hashcode: 33675143
Client hashcode: 33675143
Client hashcode: 33675143
$$$ Command insert failed: cannot continue txnId -1 for session 97c4ac3a-ad7d-4bbe-946b-22fee5860f88 - 9wor+x5Uxq/hr+= with txnId 1.
$$$ Command insert failed: cannot continue txnId -1 for session 24eb8e57-e1dd-4a34-aea5-70894e0b2cb5 - 9wor+x5Uxq/hr+= with txnId 1.
[1ade59fc] [cc112565]
[2ae07e6e] [71a6f009]
[d4d2eda8] [0608dd32]
[e0dd47e7] [06fbaa8d]
$$$ Command commitTransaction failed: Recovering the transaction's outcome found the transaction aborted.
$$$ Command commitTransaction failed: Recovering the transaction's outcome found the transaction aborted.
$$$ Command commitTransaction failed: Recovering the transaction's outcome found the transaction aborted.
$$$ Command insert failed: cannot continue txnId 1 for session 97c4ac3a-ad7d-4bbe-946b-22fee5860f88 - 9wor+x5Uxq/hr+= with txnId 2.
$$$ Command commitTransaction failed: Recovering the transaction's outcome found the transaction aborted.
$$$ Command insert failed: cannot continue txnId 1 for session 24eb8e57-e1dd-4a34-aea5-70894e0b2cb5 - 9wor+x5Uxq/hr+= with txnId 2.
[ab173381] [519df04d]
[b193a4fc] [ad94d234]
$$$ Command insert failed: cannot continue txnId 2 for session 97c4ac3a-ad7d-4bbe-946b-22fee5860f88 - 9wor+x5Uxq/hr+= with txnId 3.
$$$ Command insert failed: cannot continue txnId -1 for session e11b391f-e0e9-468d-85f1-1df24b9a0c07 - 9wor+x5Uxq/hr+= with txnId 2.
[752ba630] [59dd4708]
[52c34c5b] [f1c3e500]
$$$ Command commitTransaction failed: Recovering the transaction's outcome found the transaction aborted.
[fe25414c] [29dd2c5e]
[d6389f18] [1c39a7d4]
[9b3cf008] [6adc3e55]
[0531d97e] [83905aa9]
[ba7c4436] [7abe3365]
[a0d515f9] [c5bfe751]
$$$ Command commitTransaction failed: Recovering the transaction's outcome found the transaction aborted.
$$$ Command insert failed: cannot continue txnId -1 for session 59d537c4-25c1-4aa4-aa00-979b13bc862d - 9wor+x5Uxq/hr+= with txnId 2.
[f09fe3b5] [968407cb]
[2be49268] [60e18e66]
[570c8478] [f1a7b2b8]
[63bb494d] [90ae38a5]
[aa70ccf9] [35ea377c]
$$$ Command insert failed: cannot continue txnId 3 for session 2c82734a-d658-4f09-9c5d-d8912090b538 - 9wor+x5Uxq/hr+= with txnId 4.
$$$ Command insert failed: cannot continue txnId 2 for session 97c4ac3a-ad7d-4bbe-946b-22fee5860f88 - 9wor+x5Uxq/hr+= with txnId 5.
[2283ab6e] [6c34e3d0]
$$$ Command insert failed: cannot continue txnId 4 for session 24eb8e57-e1dd-4a34-aea5-70894e0b2cb5 - 9wor+x5Uxq/hr+= with txnId 5.
[5cc3c914] [02d4ffe1]
[7e7d0992] [1b078e08]
[e732104f] [dfa812ba]
[5c6b83f0] [49cecedd]
[caa0d28c] [97d72034]
[eb58e248] [48164d81]
[4c9d22a6] [9bd38f9c]
$$$ Command commitTransaction failed: Recovering the transaction's outcome found the transaction aborted.
[3d1b756b] [4c38342b]
$$$ Command insert failed: cannot continue txnId 2 for session fce79585-fc3a-42c6-bdf3-50db0b6ec66e - 9wor+x5Uxq/hr+= with txnId 4.
$$$ Command commitTransaction failed: Recovering the transaction's outcome found the transaction aborted.
$$$ Command insert failed: cannot continue txnId 2 for session 69a8ab4f-0353-4b9a-ba46-c67cf52e578e - 9wor+x5Uxq/hr+= with txnId 4.
[ba883283] [afb36fb1]
[3c4d9b2c] [f7fcd3ca]
[bd32440d] [ed75258a]
[81c0b002] [8df58e78]
$$$ Command insert failed: cannot continue txnId 4 for session 24eb8e57-e1dd-4a34-aea5-70894e0b2cb5 - 9wor+x5Uxq/hr+= with txnId 7.
$$$ Command commitTransaction failed: Recovering the transaction's outcome found the transaction aborted.
[618e7ec0] [791900b3]
[6e0decd9] [aea1d3a5]
[b46d248f] [b8542be9]
[bed990b6] [ab7d267b]
[37926f7c] [e3dfb8f0]
$$$ Command commitTransaction failed: Recovering the transaction's outcome found the transaction aborted.
$$$ Command insert failed: cannot continue txnId 6 for session 97c4ac3a-ad7d-4bbe-946b-22fee5860f88 - 9wor+x5Uxq/hr+= with txnId 8.
[924ee32a] [eb47183f]
[7b09a476] [405e017b]
[2c681de7] [ed59df5e]
[f00300cc] [43dff5e3]
[a133cd47] [11f7e692]
$$$ Command insert failed: cannot continue txnId 2 for session 59d537c4-25c1-4aa4-aa00-979b13bc862d - 9wor+x5Uxq/hr+= with txnId 6.
$$$ Command insert failed: cannot continue txnId 4 for session 24eb8e57-e1dd-4a34-aea5-70894e0b2cb5 - 9wor+x5Uxq/hr+= with txnId 8.
[08723d2c] [97a1158e]
[49594714] [b58e0344]
[fc727dd0] [d5cf3ef5]
[29a31282] [878c6b9c]
[2aab8797] [422ff905]
[1ccd10ea] [47d24195]
$$$ Command insert failed: cannot continue txnId 7 for session e11b391f-e0e9-468d-85f1-1df24b9a0c07 - 9wor+x5Uxq/hr+= with txnId 9.

Comment by Dmitry Lukyanov (Inactive) [ 08/Jan/21 ]

Hello finiskygarden@gmail.com . I've tried to reproduce the issue with your description and with no luck. Can you please specify/confirm the following details you can:

  1. Full connection string (with hidden credentials)
  2. You use the driver 2.11.5
  3. Sharded cluster with server 4.2. Any other cluster configuration details.
  4. You run it on Ubuntu 16.04. Can you check it on another OS (for example Windows)?
  5. Does it happen in some environments but not others? Locally/Staging/Production?
  6. Does it happen constantly? Does it happen immediately or after a while?
  7. Any other setup configuration that may be useful

Please let me know if you have any questions.

Comment by Dmitry Lukyanov (Inactive) [ 07/Jan/21 ]

Thanks finiskygarden@gmail.com for your report, we will investigate this and will let you know about results

Generated at Wed Feb 07 21:44:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.