[SERVER-69480] TransientTransactionError label potentially applied incorrectly Created: 06/Sep/22  Updated: 22/Nov/22  Resolved: 22/Nov/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alex Bevilacqua Assignee: Yujin Kang Park
Resolution: Done Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File screenshot-1.png     File server-69480.js    
Issue Links:
Depends
depends on SERVER-61909 Hang inserting or deleting document w... Closed
Related
Operating System: ALL
Steps To Reproduce:

1. Setup Environment (single node replica set, MongoDB 6.0.1)

# MongoDB 6.0.1-ent
mongod --dbpath data --replSet rs0 --wiredTigerCacheSizeGB 1
mongosh --quiet --eval 'rs.initiate()'

2. Test with C# Driver (2.17.1 / .NET 6.0)

using MongoDB.Bson;
using MongoDB.Driver;
 
Random random = new();
void Log(string value)
{
    Console.WriteLine($"{DateTime.Now}: {value}");
}
string RandomString(int length)
{
    const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    return new string(Enumerable.Repeat(chars, length)
        .Select(s => s[random.Next(s.Length)]).ToArray());
}
 
Log("Starting");
var junk = RandomString(1024 * 1024 * 5);
 
var client = new MongoClient("mongodb://localhost:27017/test?retryWrites=true");
var db = client.GetDatabase("test");
 
var cacheSize = db.RunCommand<BsonDocument>(new BsonDocument("serverStatus", 1))["wiredTiger"]["cache"]["maximum bytes configured"].AsInt32;
Log($"WiredTiger Cache = {cacheSize}");
var collection = db.GetCollection<BsonDocument>("test");
 
string? result = null;
 
using (var session = client.StartSession())
{
    var transactionOptions = new TransactionOptions(
        readPreference: ReadPreference.Primary,
        readConcern: ReadConcern.Local,
        writeConcern: WriteConcern.WMajority);
 
    var cancellationToken = CancellationToken.None; // normally a real token would be used
    result = session.WithTransaction(
        (s, ct) =>
        {
            for (var i = 0; i < 50; i++)
            {
                collection.InsertOne(s, new BsonDocument { { "junk", junk } }, cancellationToken: ct);
            }            
            return "Success!";
        },
        transactionOptions,
        cancellationToken);
}
 
Log("Done");

Sprint: Execution Team 2022-10-31, Execution Team 2022-11-28
Participants:
Case:

 Description   

The log message for MongoDB 6.0 is:

{"t":{"$date":"2022-09-06T15:19:36.865-04:00"},"s":"I","c":"COMMAND","id":51803,"ctx":"conn11","msg":"Slow query","attr":{"type":"command","ns":"test.test","command":{"insert":"test","ordered":true,"$db":"test","lsid":{"id":{"$uuid":"b8d52586-f9d3-4833-aa4a-a2d9b1aa88c1"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1662491969,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"txnNumber":1,"autocommit":false},"ninserted":0,"numYields":0,"ok":0,"errMsg":"WiredTigerIdIndex::_insert: index: _id_; uri: table:index-39-129709413832108471 :: caused by :: WriteConflict error: this operation conflicted with another operation. Please retry your operation or multi-document transaction.","errName":"WriteConflict","errCode":112,"reslen":490,"locks":{"FeatureCompatibilityVersion":{"acquireCount":{"r":1,"w":3}},"ReplicationStateTransition":{"acquireCount":{"w":6}},"Global":{"acquireCount":{"r":1,"w":3}},"Database":{"acquireCount":{"w":3}},"Collection":{"acquireCount":{"r":1,"w":3}},"Mutex":{"acquireCount":{"r":39}}},"flowControl":{"acquireCount":1,"timeAcquiringMicros":1},"readConcern":{"level":"local","provenance":"clientSupplied"},"storage":{"timeWaitingMicros":{"cache":139271}},"remote":"127.0.0.1:62025","protocol":"op_msg","durationMillis":185}}

Previously (MongoDB 5.0) this was logged as follows (likely due to WT-8290 not being backported):

{"t":{"$date":"2022-09-06T14:41:54.550-04:00"},"s":"I","c":"STORAGE","id":22430,"ctx":"conn5","msg":"WiredTiger message","attr":{"message":"oldest pinned transaction ID rolled back for eviction"}}
{"t":{"$date":"2022-09-06T14:41:54.550-04:00"},"s":"I","c":"TXN","id":51802,"ctx":"conn5","msg":"transaction","attr":{"parameters":{"lsid":{"id":{"$uuid":"a5596ab7-9c31-4198-9548-7fc30bd8e61a"},"uid":{"$binary":{"base64":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}}},"txnNumber":1,"autocommit":false,"readConcern":{"level":"local","provenance":"clientSupplied"}},"readTimestamp":"Timestamp(0, 0)","ninserted":29,"keysInserted":29,"terminationCause":"aborted","timeActiveMicros":1240154,"timeInactiveMicros":321842,"numYields":0,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":5}},"Global":{"acquireCount":{"r":1,"w":2}},"Database":{"acquireCount":{"w":2}},"Collection":{"acquireCount":{"w":2}},"Mutex":{"acquireCount":{"r":31}}},"storage":{"data":{"bytesRead":13101},"timeWaitingMicros":{"cache":739459}},"wasPrepared":false,"durationMillis":1561}}

The transactions specification (Interaction with Retryable Writes) indicates that:

In MongoDB 4.0 the only supported retryable write commands within a transaction are `commitTransaction` and `abortTransaction`. Therefore drivers MUST NOT retry write commands within transactions even when `retryWrites` has been enabled on the MongoClient. In addition, drivers MUST NOT add the `RetryableWriteError` label to any error that occurs during a write command within a transaction (excepting `commitTransation` and `abortTransaction`), even when `retryWrites` has been enabled on the MongoClient.

It appears here that the failure isn't with the write itself but updating the indexes. Should this still be labelled as a TransientTransactionError?

Screenshot of the reproduction running in Visual Studio 2022 below:



 Comments   
Comment by Yujin Kang Park [ 22/Nov/22 ]

SERVER-61909 has been completed, and a non-retryable TransactionTooLargeForCache error been introduced. As louis.williams@mongodb.com mentioned in the previous comments, this error is thrown when a transaction rolls back and is found to be using a large portion (75% by default) of the total dirty cache. Since the dirty cache limit is 20% of the total cache, this means that the largest transactions may only occupy 15% of the total size of the storage engine cache.

We verified with the reproducer (converted to JS) server-69480.js that this new TransactionTooLargeForCache error is received.

Some more details can be found here:
https://github.com/10gen/mongo/blob/master/src/mongo/db/catalog/README.md#transactiontoolargeforcacheexception

Comment by Louis Williams [ 31/Oct/22 ]

For those watching, we are implementing this work in an existing ticket, SERVER-61909. The expected outcome is that we will return a non-retryable error when a transaction rolls back and we detect that it has been consuming large portions of cache space.

Comment by Louis Williams [ 20/Sep/22 ]

For watchers, there's a lot of context and history related to this in SERVER-60839 and SERVER-61909. I'll try to summarize here:

WiredTiger can roll back a transaction for multiple reasons. Usually, it's due to a write-write conflict with another transaction. In the case of this application, they're creating very large transaction that is too big for the cache. WiredTiger, in attempt to free up space, starts rolling back transactions, starting from the oldest. Eventually, it gets to the transaction causing problems and rolls it back. In MongoDB we don't know why WiredTiger rolled back the transaction, which is why we implemented WT-8290 to give us that information. However, even with that new information, we still don't know if the rolled-back transaction was the cause of the problem or the victim. For that reason, we continue to return a retryable WriteConflict error when we hit this inside multi-doc transactions.

As far as solutions, we could return a non-retryable TemporarilyUnavailable error code in this circumstance. We already return this for non-multi-doc transactions writes as of 6.1 (SERVER-67984). The risk of doing this for multi-doc transactions is that any transaction can fail with this error, even if it is not the transaction causing problems. We felt that this was too risky of a change that could cause unexpected errors in applications using transactions, so we decided not to change any behavior. For non-multi-doc writes, we retry a fixed number of times internally before eventually letting the error escape. This is done in an attempt to mitigate the application impact of victim transactions. To do the same for multi-doc-transactions, we would need drivers support to handle TemporarilyUnavailable errors in a similar way to avoid delivering spurious application errors. If we think the risk of spurious application errors is overstated, we can evaluate returning a TemporarilyUnavailable error for multi-document writes in 6.1 (SERVER-67984). We should also consider scheduling follow-on drivers work to deal with spurious TU errors.

We are also considering another solution, SERVER-61909, which leverages WT-8848 to limit transactions to a specific amount of dirty content. If we detect a transaction has exceeded that limit, then we will return a non-retryable error.

Comment by Alex Bevilacqua [ 20/Sep/22 ]

louis.williams@mongodb.com 5.0 and 6.0 are functionally (and I believe behaviorally) the same - only what is logged is different.

Comment by Louis Williams [ 20/Sep/22 ]

alex.bevilacqua@mongodb.com can you clarify whether or not there was a behavioral change between 5.0 and 6.0? I see that you posted the log messages from both versions, but is there a behavioral difference?

Generated at Thu Feb 08 06:13:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.