-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Replication
-
ALL
-
Repl 2025-12-08, Repl 2025-12-22
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
BF-40166 saw the following sequence of events:
- rs0:n0 is acting as a router, it gets a start txn request and is retrying it and adding rs1 as a participant every time with snapshot readConcern set with a cluster time {"t":1761295748,"i":107}
- continuousStepdown hook force kills the primary on rs0 (n0), and n1 is elected as the new primary.
- When rs0:n0 is force killed, it does not populate the full error response here that would append the transaction participants in the response to mongos, so rs1 is not cleared/aborted and its snapshot read timestamp is still {"t":1761295748,"i":107}
- mongos0 now retries the txn again, now on s0:n1 and it hits this retry function due to stale config error (we retry on transient config errors). We are no longer aware of rs1 as a pending participant, so it doesn't clear it and resets the snapshot read time as "$timestamp":{"t":1761295748,"i":195}
- Now it tries to add rs1 as a participant again, and attaches the new "$timestamp":{"t":1761295748,"i":195}
value. This then conflicts with the stashed txnResource snapshot read time on rs1 which is still {"t":1761295748,"i":107}
and this is a non-retryable error so we abort the txn and fail the test.
We never ran into this previously because we weren’t retrying on the mongos level if it was a startTransaction, but since that was removed in SERVER-88289 we will need to account for this scenario. This ticket removed that check because we improved the session/txn invalidation so we would be able to retry a txn request on a new primary, but this case evades that because the force kill follows a different error handling code path.
One possible solution could be appending the pending participants to the error response to mongos in step down - I'm trying to repro this part of the failure so we can compare the error responses for step down to the errors that go through CheckoutSessionAndInvokeCommand::_tapError
- is related to
-
SERVER-88289 Remove manual check that skips retrying requests with startTransaction in ARS
-
- Closed
-