Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-91413

Executing an aggregation with transaction sub-stages that makes cursors on another shard can return NotARetryableWriteCommand error

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 8.1.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Execution
    • Fully Compatible
    • ALL
    • v8.0
    • QE 2024-07-08, QE 2024-07-22, QE 2024-08-05, QE 2024-08-19
    • 200

      when it hits an error during unyield.

      Here's an example sequence:

      1. Sub-pipeline in the shard tries to perform lookup from viewA (a view on collectionX) on collectionX.
      2. sub-pipeline tries to execute the lookup using CursorEstablisher which uses ARS under the hood.
      3. It sends request to 2 shards: shardA and shardB.
      4. ARS yields and waits for one shard to get response back.
      5. shardA gets stale config exception, CursorEstablisher stores the error in _maybeFailure.
      6. ARS::next gets called again since it is !done, it waits for the response from ShardB and yields again.
      7. Response comes back, unyield gets called, but it fails to unstash because it couldn't acquire the ticket. Note: TransactionParticipantResourceYielder::yield only stashes locks. Locks are still held even when yielded, but the tickets are released, so it can fail to acquire the tickets when unstashing.
      8. Because of the early return, this causes the session to get checked back in.
      9. CursorEstablisher sees the error, but since the _maybeFailure was already set earlier, it kind of ignores it since it does not have priority (as of the moment, only UUID errors can override existing _maybeFailures)
      10. Router role sees the stale config error from establishCursors and tries to retry the operation. But since the session was already checked back in, transaction_request_sender_details::attachTxnDetails ends up being a no-op.
      11. Shard complains with NotARetryableWriteCommand because the retried request was missing the relevant txn fields.

      Also attached a diff to demonstrate the establishCursor behavior.

      This ticket should also revert SERVER-91414

            Assignee:
            mickey.winters@mongodb.com Mickey Winters
            Reporter:
            randolph@mongodb.com Randolph Tan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: