Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- repl-shortlist

Assigned Teams:

Replication
Operating System:
ALL
Sprint:
Repl 2025-12-08, Repl 2025-12-22
Linked BF Score:
200
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

BF-40166 saw the following sequence of events:

rs0:n0 is acting as a router, it gets a start txn request and is retrying it and adding rs1 as a participant every time with snapshot readConcern set with a cluster time {"t":1761295748,"i":107}
continuousStepdown hook force kills the primary on rs0 (n0), and n1 is elected as the new primary.
When rs0:n0 is force killed, it does not populate the full error response here that would append the transaction participants in the response to mongos, so rs1 is not cleared/aborted and its snapshot read timestamp is still {"t":1761295748,"i":107}
mongos0 now retries the txn again, now on s0:n1 and it hits this retry function due to stale config error (we retry on transient config errors). We are no longer aware of rs1 as a pending participant, so it doesn't clear it and resets the snapshot read time as "$timestamp":{"t":1761295748,"i":195}
Now it tries to add rs1 as a participant again, and attaches the new "$timestamp":{"t":1761295748,"i":195}
value. This then conflicts with the stashed txnResource snapshot read time on rs1 which is still {"t":1761295748,"i":107}
and this is a non-retryable error so we abort the txn and fail the test.

We never ran into this previously because we weren’t retrying on the mongos level if it was a startTransaction, but since that was removed in ~~SERVER-88289~~ we will need to account for this scenario. This ticket removed that check because we improved the session/txn invalidation so we would be able to retry a txn request on a new primary, but this case evades that because the force kill follows a different error handling code path.

One possible solution could be appending the pending participants to the error response to mongos in step down - I'm trying to repro this part of the failure so we can compare the error responses for step down to the errors that go through CheckoutSessionAndInvokeCommand::_tapError

is related to

SERVER-88289 Remove manual check that skips retrying requests with startTransaction in ARS

Closed

Assignee:: Ruchitha Rajaghatta
Reporter:: Ruchitha Rajaghatta
Participants:: Ruchitha Rajaghatta
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Nov 07 2025 07:16:48 PM UTC
Updated:: Dec 08 2025 06:40:33 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates