Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-69234

Make txn_recover_decision_using_recovery_router.js robust to spurious shard version refreshes

    • Cluster Scalability
    • 33
    • 2

      The txn_recover_decision_using_recovery_router.js test runs transactions without any transient error retry logic. It is possible for a shard version refresh to be triggered by a secondary running _flushRoutingTableCacheUpdates against the primary of the replica set shard. The shard version refresh will take CollectionShardingRuntime::_stateChangeMutex (aka CSRLock) in MODE_X and potentially cause the transaction to time out with a LockTimeout.

      [js_test:txn_recover_decision_using_recovery_router] s23547| 2022-08-25T09:04:15.033+00:00 D3 EXECUTOR 22607   [conn12] "Scheduling remote command request","attr":{"request":"RemoteCommand 115 -- target:[ip-10-128-58-199:23542] db:test cmd:{ find: \"user\", filter: { x: 1.0 }, shardVersion: { t: Timestamp(1661418244, 76), e: ObjectId('63073b04ea5d5cbf63d64294'), v: Timestamp(2, 0) }, txnNumber: 15, clientOperationKey: UUID(\"3f4bbcec-212d-4c25-a292-2f0aa0312afe\"), readConcern: {}, startTransaction: true, autocommit: false, lsid: { id: UUID(\"c66be7c9-c7f6-450e-9846-ec23ef55a7fe\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) } }"}
      ...
      [js_test:txn_recover_decision_using_recovery_router] d23542| 2022-08-25T09:04:15.034+00:00 D4 TXN      23984   [conn50] "New transaction started","attr":{"txnNumberAndRetryCounter":{"txnNumber":15,"txnRetryCounter":0},"lsid":{"id":{"$uuid":"c66be7c9-c7f6-450e-9846-ec23ef55a7fe"},"uid":{"$binary":{"base64":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}}},"apiParameters":{}}
      [js_test:txn_recover_decision_using_recovery_router] d23542| | 2022-08-25T09:04:15.039+00:00 I  SH_REFR  4619901 [CatalogCache-1] "Refreshed cached collection","attr":{"namespace":"test.user","lookupSinceVersion":"2|1||63073b04ea5d5cbf63d64294||Timestamp(1661418244, 76)","newVersion":"{ chunkVersion: { t: Timestamp(1661418244, 76), e: ObjectId('63073b04ea5d5cbf63d64294'), v: Timestamp(2, 1) }, forcedRefreshSequenceNum: 11, epochDisambiguatingSequenceNum: 10 }","timeInStore":"{ chunkVersion: \"None\", forcedRefreshSequenceNum: 10, epochDisambiguatingSequenceNum: 9 }","durationMillis":6}
      [js_test:txn_recover_decision_using_recovery_router] d23542| 2022-08-25T09:04:15.039+00:00 I  TXN      51802   [conn50] "transaction","attr":{"parameters":{"lsid":{"id":{"$uuid":"c66be7c9-c7f6-450e-9846-ec23ef55a7fe"},"uid":{"$binary":{"base64":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}}},"txnNumber":15,"txnRetryCounter":0,"autocommit":false,"readConcern":{"provenance":"implicitDefault"}},"readTimestamp":"Timestamp(0, 0)","terminationCause":"aborted","timeActiveMicros":5334,"timeInactiveMicros":24,"numYields":0,"locks":{"FeatureCompatibilityVersion":{"acquireCount":{"w":1}},"ReplicationStateTransition":{"acquireCount":{"w":3}},"Global":{"acquireCount":{"w":1}},"Database":{"acquireCount":{"w":1}},"Collection":{"acquireCount":{"w":1}},"Mutex":{"acquireCount":{"r":2},"acquireWaitCount":{"r":1},"timeAcquiringMicros":{"r":5127}}},"storage":{},"wasPrepared":false,"durationMillis":5}
      ...
      [js_test:txn_recover_decision_using_recovery_router] s23547| 2022-08-25T09:04:15.040+00:00 D2 ASIO     22597   [conn12] "Request finished with response","attr":{"requestId":115,"isOK":true,"response":"{ errorLabels: [ \"TransientTransactionError\" ], ok: 0.0, errmsg: \"Unable to acquire IS lock on '{11529215046068469802: Mutex, 42, test.user}' within 5ms.\", code: 24, codeName: \"LockTimeout\", lastCommittedOpTime: Timestamp(1661418253, 1), $clusterTime: { clusterTime: Timestamp(1661418255, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configTime: Timestamp(1661418253, 2), $topologyTime: Timestamp(1661418244, 11), operationTime: Timestamp(1661418255, 1) }"}
      

      https://evergreen.mongodb.com/lobster/build/020fd412154bbc286df8a350e072a348/test/170e8b1523083360427454b935aa341e#bookmarks=0%2C9437%2C9441%2C9442%2C9443%2C9446%2C9473%2C19666&f~=000~23542%5C%7C&shareLine=9442

      The txn_recover_decision_using_recovery_router.js test should be updated to either (a) use the retryOnceOnTransientOnMongos() helper for each of its test cases or (b) consider raising the value of the maxTransactionLockRequestTimeoutMillis server parameter to avoid the transaction timing out due to a LockTimeout.

            Assignee:
            backlog-server-cluster-scalability [DO NOT USE] Backlog - Cluster Scalability
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: