-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Replication
-
ALL
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
BF-40166 saw the following sequence of events:
- rs0:n01 is acting as a router, it gets a start txn request and is retrying it and adding rs1 as a participant every time with snapshot readConcern set with a cluster time {{
{"t":1761295748,"i":107}
}}
- continuousStepdown hook force kills the primary on rs0 (n1), and n1 is elected as the new primary.
- When rs0:n0 }}is force killed, it does not populate the full error response here that would append the transaction participants in the response to mongos, so {{rs1 is not cleared/aborted and its snapshot read timestamp is still {{
{"t":1761295748,"i":107}
}}
- mongos0 now retries the txn again, now on s0:n1 and it hits this retry function due to stale config error (we retry on transient config errors). We are no longer aware of rs1 as a pending participant, so it doesn't clear it and resets the snapshot read time as {{
{"$timestamp":\{"t":1761295748,"i":195}
}}
- Now it tries to add rs1 as a participant again, and attaches the new {{
{"$timestamp":\{"t":1761295748,"i":195}
}} value. This then conflicts with the stashed txnResource snapshot read time on rs1 which is still {{
{"t":1761295748,"i":107}}}and this is a non-retryable error so we abort the txn and fail the test.
We never ran into this previously because we weren’t retrying on the mongos level if it was a startTransaction, but since that was removed in SERVER-88289 we will need to account for this scenario. This ticket removed that check because we improved the session/txn invalidation so we would be able to retry a txn request on a new primary, but this case evades that because the force kill follows a different error handling code path.
One possible solution could be appending the pending participants to the error response to mongos in step down - I'm trying to repro this part of the failure so we can compare the error responses for step down to the errors that go through CheckoutSessionAndInvokeCommand::_tapError
- is related to
-
SERVER-88289 Remove manual check that skips retrying requests with startTransaction in ARS
-
- Closed
-