[SERVER-65633] withTxnAndAutoRetry() retries entire transaction when commitTransaction fails with FailedToSatisfyReadPreference Created: 13/Apr/22  Updated: 06/May/22  Resolved: 02/May/22

Status: Closed
Project: Core Server
Component/s: Sharding, Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Andrew Witten (Inactive)
Resolution: Done Votes: 0
Labels: sharding-nyc-subteam2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Problem/Incident
is caused by SERVER-62377 withTxnAndAutoRetry() must retry comm... Closed
Operating System: ALL
Sprint: Sharding 2022-05-02
Participants:
Story Points: 3

 Description   

The changes from 50caf49 as part of SERVER-62377 added the following logic to the withTxnAndAutoRetry() helper function. Setting hasTransientError causes a new transaction to be started. However, FailedToSatisfyReadPreference doesn't guarantee the commitTransaction command wasn't sent because mongos may have retried the commitTransaction command internally due a retryable error and then encountered the FailedToSatisfyReadPreference error.

// FailedToSatisfyReadPreference errors are not retryable.
// However, they should be because if there is no primary, there should be one soon.
// TODO SERVER-60706: Make FailedToSatisfyReadPreference a transient error
if (e.code == ErrorCodes.FailedToSatisfyReadPreference) {
    print("Retrying transaction due to a FailedToSatisfyReadPreference error.");
    hasTransientError = true;
    continue;
}

The withTxnAndAutoRetry() helper function must retry only the commitTransaction command to learn the definitive result of the transaction. Upon seeing an error labeled with TransientTransactionError may the withTxnAndAutoRetry() helper function retry the entire transaction. Retrying the commitTransaction command on FailedToSatisfyReadPreference is perhaps best done inside commitTransactionWithKilledSessionRetries().


This has been observed to cause a spurious duplicate key error due to the mongo shell when running the create_and_drop_collection.js FSM workload in the concurrency_sharded_multi_stmt_txn_with_stepdowns Evergreen task.

[j0:s0] | 2022-03-25T16:07:09.162+00:00 D3 EXECUTOR 22607   [conn1685] "Scheduling remote command request","attr":{"request":"RemoteCommand 25962 -- target:[localhost:20003] db:test18_fsmdb0 cmd:{ insert: \"fsmcoll0\", bypassDocumentValidation: false, ordered: true, documents: [ { _id: 14.0 } ], shardVersion: [ Timestamp(0, 0), ObjectId('000000000000000000000000'), Timestamp(0, 0) ], databaseVersion: { uuid: UUID(\"f2af1ae1-5f1a-41eb-b2ab-40f6eb21b128\"), timestamp: Timestamp(1648224429, 9), lastMod: 1 }, lsid: { id: UUID(\"f5ff2e85-d579-4529-919c-675912543218\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 33, readConcern: { level: \"snapshot\", atClusterTime: Timestamp(1648224429, 9) }, startTransaction: true, coordinator: true, autocommit: false }"}
[j0:s0] | 2022-03-25T16:07:09.181+00:00 D3 EXECUTOR 22607   [conn1685] "Scheduling remote command request","attr":{"request":"RemoteCommand 25964 -- target:[localhost:20003] db:admin cmd:{ commitTransaction: 1, writeConcern: { w: \"majority\", wtimeout: 300321, provenance: \"clientSupplied\" }, coordinator: true, autocommit: false, txnNumber: 33, lsid: { id: UUID(\"f5ff2e85-d579-4529-919c-675912543218\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) } }"}
...
[j0:s0] | 2022-03-25T16:07:09.194+00:00 D2 ASIO     22597   [conn1685] "Request finished with response","attr":{"requestId":25964,"isOK":true,"response":"{ readOnly: false, writeConcernError: { code: 11602, codeName: \"InterruptedDueToReplStateChange\", errmsg: \"operation was interrupted\", errInfo: { writeConcern: { w: \"majority\", wtimeout: 300321, provenance: \"clientSupplied\" } } }, ok: 1.0, topologyVersion: { processId: ObjectId('623de6b9393bc43269e4007c'), counter: 95 }, lastCommittedOpTime: Timestamp(1648224429, 11), $clusterTime: { clusterTime: Timestamp(1648224429, 13), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configTime: Timestamp(1648224429, 10), $topologyTime: Timestamp(1648223955, 17), operationTime: Timestamp(1648224429, 12) }"}
...
[j0:s0] | 2022-03-25T16:07:24.195+00:00 D1 ASSERT   23074   [conn1685] "User assertion","attr":{"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set shard-rs0","file":"src/mongo/s/transaction_router.cpp","line":207}
[j0:s0] | 2022-03-25T16:07:24.195+00:00 D3 TXN      22897   [conn1685] "Implicitly aborting transaction on all participant shards","attr":{"sessionId":{"id":{"$uuid":"f5ff2e85-d579-4529-919c-675912543218"},"uid":{"$binary":{"base64":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}}},"txnNumber":33,"txnRetryCounter":0,"numParticipantShards":1,"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set shard-rs0"}
...
[j0:s0] | 2022-03-25T16:07:31.747+00:00 D3 EXECUTOR 22607   [conn1685] "Scheduling remote command request","attr":{"request":"RemoteCommand 26007 -- target:[localhost:20004] db:test18_fsmdb0 cmd:{ insert: \"fsmcoll0\", bypassDocumentValidation: false, ordered: true, documents: [ { _id: 14.0 } ], shardVersion: [ Timestamp(0, 0), ObjectId('000000000000000000000000'), Timestamp(0, 0) ], databaseVersion: { uuid: UUID(\"f2af1ae1-5f1a-41eb-b2ab-40f6eb21b128\"), timestamp: Timestamp(1648224429, 9), lastMod: 1 }, lsid: { id: UUID(\"f5ff2e85-d579-4529-919c-675912543218\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 34, readConcern: { level: \"snapshot\", atClusterTime: Timestamp(1648224451, 2) }, startTransaction: true, coordinator: true, autocommit: false }"}
[fsm_workload_test:create_and_drop_collection] [tid:0] Retrying transaction due to a FailedToSatisfyReadPreference error.
...
[fsm_workload_test:create_and_drop_collection] uncaught exception: Error: 1 thread with tids [0] threw
[fsm_workload_test:create_and_drop_collection]         	"errmsg" : "E11000 duplicate key error collection: test18_fsmdb0.fsmcoll0 index: _id_ dup key: { _id: 14.0 } found value: RecordId(1)",
[fsm_workload_test:create_and_drop_collection]         	"op" : {
[fsm_workload_test:create_and_drop_collection]         		"_id" : 14
[fsm_workload_test:create_and_drop_collection]         	}

https://logkeeper.mongodb.org/lobster/build/bdd28368e04d3a19371ebfc49c1dbc1d/test/623de8a2be07c479035d686c#bookmarks=0%2C8060%2C8124%2C8167%2C10092%2C10093%2C11685%2C11696%2C15388%2C15394%2C15395%2C15396%2C15397%2C16446&f~=100~Scheduling%20remote%20command%20request.%2A%28insert%7CcommitTransaction%29&l=1



 Comments   
Comment by Githook User [ 06/May/22 ]

Author:

{'name': 'Andrew Witten', 'email': 'andrew.witten@mongodb.com', 'username': 'awitten1'}

Message: SERVER-65633 retry commitTransaction on FailedToSatisfyReadPreference

(cherry picked from commit e90e25cc5ad352d0c8b074f0ad49945b483d7131)
Branch: v5.0
https://github.com/mongodb/mongo/commit/4a919ea37a9bccd2941f9eb0c7ac7a72ae178833

Comment by Githook User [ 05/May/22 ]

Author:

{'name': 'Andrew Witten', 'email': 'andrew.witten@mongodb.com', 'username': 'awitten1'}

Message: SERVER-65633 retry commitTransaction on FailedToSatisfyReadPreference

(cherry picked from commit e90e25cc5ad352d0c8b074f0ad49945b483d7131)
Branch: v5.3
https://github.com/mongodb/mongo/commit/bc1ac35debfc926c3d12f11f5ca7230010aacd55

Comment by Githook User [ 25/Apr/22 ]

Author:

{'name': 'Andrew Witten', 'email': 'andrew.witten@mongodb.com', 'username': 'awitten1'}

Message: SERVER-65633 retry commitTransaction on FailedToSatisfyReadPreference
Branch: master
https://github.com/mongodb/mongo/commit/e90e25cc5ad352d0c8b074f0ad49945b483d7131

Generated at Thu Feb 08 06:03:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.