[SERVER-69234] Make txn_recover_decision_using_recovery_router.js robust to spurious shard version refreshes Created: 29/Aug/22  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 0
Labels: cs-subteam1, sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-60758 Prevent dbVersion refreshes from fail... Closed
Assigned Teams:
Cluster Scalability
Participants:
Case:
Linked BF Score: 33
Story Points: 2

 Description   

The txn_recover_decision_using_recovery_router.js test runs transactions without any transient error retry logic. It is possible for a shard version refresh to be triggered by a secondary running _flushRoutingTableCacheUpdates against the primary of the replica set shard. The shard version refresh will take CollectionShardingRuntime::_stateChangeMutex (aka CSRLock) in MODE_X and potentially cause the transaction to time out with a LockTimeout.

[js_test:txn_recover_decision_using_recovery_router] s23547| 2022-08-25T09:04:15.033+00:00 D3 EXECUTOR 22607   [conn12] "Scheduling remote command request","attr":{"request":"RemoteCommand 115 -- target:[ip-10-128-58-199:23542] db:test cmd:{ find: \"user\", filter: { x: 1.0 }, shardVersion: { t: Timestamp(1661418244, 76), e: ObjectId('63073b04ea5d5cbf63d64294'), v: Timestamp(2, 0) }, txnNumber: 15, clientOperationKey: UUID(\"3f4bbcec-212d-4c25-a292-2f0aa0312afe\"), readConcern: {}, startTransaction: true, autocommit: false, lsid: { id: UUID(\"c66be7c9-c7f6-450e-9846-ec23ef55a7fe\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) } }"}
...
[js_test:txn_recover_decision_using_recovery_router] d23542| 2022-08-25T09:04:15.034+00:00 D4 TXN      23984   [conn50] "New transaction started","attr":{"txnNumberAndRetryCounter":{"txnNumber":15,"txnRetryCounter":0},"lsid":{"id":{"$uuid":"c66be7c9-c7f6-450e-9846-ec23ef55a7fe"},"uid":{"$binary":{"base64":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}}},"apiParameters":{}}
[js_test:txn_recover_decision_using_recovery_router] d23542| | 2022-08-25T09:04:15.039+00:00 I  SH_REFR  4619901 [CatalogCache-1] "Refreshed cached collection","attr":{"namespace":"test.user","lookupSinceVersion":"2|1||63073b04ea5d5cbf63d64294||Timestamp(1661418244, 76)","newVersion":"{ chunkVersion: { t: Timestamp(1661418244, 76), e: ObjectId('63073b04ea5d5cbf63d64294'), v: Timestamp(2, 1) }, forcedRefreshSequenceNum: 11, epochDisambiguatingSequenceNum: 10 }","timeInStore":"{ chunkVersion: \"None\", forcedRefreshSequenceNum: 10, epochDisambiguatingSequenceNum: 9 }","durationMillis":6}
[js_test:txn_recover_decision_using_recovery_router] d23542| 2022-08-25T09:04:15.039+00:00 I  TXN      51802   [conn50] "transaction","attr":{"parameters":{"lsid":{"id":{"$uuid":"c66be7c9-c7f6-450e-9846-ec23ef55a7fe"},"uid":{"$binary":{"base64":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}}},"txnNumber":15,"txnRetryCounter":0,"autocommit":false,"readConcern":{"provenance":"implicitDefault"}},"readTimestamp":"Timestamp(0, 0)","terminationCause":"aborted","timeActiveMicros":5334,"timeInactiveMicros":24,"numYields":0,"locks":{"FeatureCompatibilityVersion":{"acquireCount":{"w":1}},"ReplicationStateTransition":{"acquireCount":{"w":3}},"Global":{"acquireCount":{"w":1}},"Database":{"acquireCount":{"w":1}},"Collection":{"acquireCount":{"w":1}},"Mutex":{"acquireCount":{"r":2},"acquireWaitCount":{"r":1},"timeAcquiringMicros":{"r":5127}}},"storage":{},"wasPrepared":false,"durationMillis":5}
...
[js_test:txn_recover_decision_using_recovery_router] s23547| 2022-08-25T09:04:15.040+00:00 D2 ASIO     22597   [conn12] "Request finished with response","attr":{"requestId":115,"isOK":true,"response":"{ errorLabels: [ \"TransientTransactionError\" ], ok: 0.0, errmsg: \"Unable to acquire IS lock on '{11529215046068469802: Mutex, 42, test.user}' within 5ms.\", code: 24, codeName: \"LockTimeout\", lastCommittedOpTime: Timestamp(1661418253, 1), $clusterTime: { clusterTime: Timestamp(1661418255, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configTime: Timestamp(1661418253, 2), $topologyTime: Timestamp(1661418244, 11), operationTime: Timestamp(1661418255, 1) }"}

https://evergreen.mongodb.com/lobster/build/020fd412154bbc286df8a350e072a348/test/170e8b1523083360427454b935aa341e#bookmarks=0%2C9437%2C9441%2C9442%2C9443%2C9446%2C9473%2C19666&f~=000~23542%5C%7C&shareLine=9442

The txn_recover_decision_using_recovery_router.js test should be updated to either (a) use the retryOnceOnTransientOnMongos() helper for each of its test cases or (b) consider raising the value of the maxTransactionLockRequestTimeoutMillis server parameter to avoid the transaction timing out due to a LockTimeout.



 Comments   
Comment by Jordi Serra Torrens [ 18/Nov/22 ]

BF-26889 corresponds to this same situation on retryable_writes_retry_without_internal_txns.js. I was able to reproduce the failure adding the sleep suggested in the comment above and running on the shard_max_mirroring suite. I'm linking BF-26889 to this ticket as the solution should be the same.

Comment by Max Hirschhorn [ 26/Sep/22 ]

Acceptance criteria:

Generated at Thu Feb 08 06:12:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.