-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
ALL
-
Cluster Scalability Priorities
-
0
The MultiUpdateCoordinator uses a session from the InternalSessionPool to perform cluster updates. To make a MultiUpdateCoordinator use the same session for its entire lifetime, after acquiring a session from the InternalSessionPool, the lsid and txnNumber for the sesion are persisted to in the coordinator state document. When the MultiUpdateCoordinator transitions to state "done", the session is released to the InternalSessionPool. This is done whether or not the operation succeeds, and whether or not the current primary is the one that the session was acquired from. This means that the session can be released twice in the following case.
Consider a primary shard with nodes: node0, node1 and node2.
- Term T, node0 is the primary.
- We run a multi-update command. The MultiUpdateCoordinator runs on node0. It acquires a Session from the InternalSessionPool on node0 and gets back a Session with some lsid and txnNumber=5. The coordinator writes down the lsid and txnNumber in its state doc. After doing that, the MultiUpdateCoordinator performs the multi-update and then starts transitioning from state "success" to state "done". It deletes the state doc (using the default writeConcern,i.e. w: 1), updates the in-memory state to "done", then it waits for majority using this untilMajorityCommitOr helper which is configured to automatically return when there is a stepdown event. Let's say while waiting for majority, node0 steps down. Then, in onCompletion(), since the state is "done" and _shouldReleaseSession doesn't care about the status of the chain, it still releases the Session back to InternalSessionPool which increments its txnNumber to 6.
- Term T+1, node1 is the primary.
- The delete gets rolled back on node0.
- The MultiUpdateCoordinator resumes the multi-update from term T on node1 from the state "success" since the state doc is still. It starts transitioning state "done" again. Then, in onCompletion(), it releases the Session again but to the InternalSessionPool on node1, which also increments the txnNumber to 6.
- We run another multi-update command. The MultiUpdateCoordinator acquires a Session from the InternalSessionPool and gets back the same Session. The MultiUpdateCoordinator runs to completion and releases the Session back to the InternalSessionPool which increments its txnNumber to 7.
So at this point this session exists on both node0 and node1 but on node0 it has txnNumber 6 and on node1 it has txnNumber 7. This is harmless to the next MultiUpdateCoordinator since it doesn't use the txnNumber for anything. However, it can cause issues for issues for users of InternalSessionPool's system sessions that use those sessions run transactions and retryable writes, namely the ShardingDDLCoordinator, ConfigsvrCoordinator, and SplitPrepareSessionManager (in transaction oplog application). Here is an example of how the issues can manifest for the ShardingDDLCoordinator:
- Term T+1 (continued from above):
- We run 99 more multi-update commands. They all end up using the same Session so the txnNumber gets incremented 99 more times.
- We run a DDL command. The ShardingDDLCoordinator runs on node1. It acquires a Session from the InternalSessionPool on node1 and gets back this very same Session (i.e. txnNumber=106). The coordinator uses this Session to run retryable write. It runs to completion successfully and releases to Session to the InternalSessionPool on node1, which then incrementing txnNumber to 107.
- Term T+2, node0 is the primary again.
- We run another DDL command. The ShardingDDLCoordinator runs on node0. It acquires a Session from the InternalSessionPool on node0 and gets back this very same Session (i.e. txnNumber=6). The coordinator uses the Session to run a retryable write. The retryable writes fails with TransactionTooOld in the TransactionParticipant since the last used txnNumber in this session is 106. We can retry again. But if there is no concurrent DDL or multi-update that would take this one session out of the InternalSessionPool, we would get back this Session and get TransactionTooOld 100 more times.
- We run another DDL command. The ShardingDDLCoordinator runs on node0. It acquires a Session from the InternalSessionPool on node0 and gets back this very same Session (i.e. txnNumber=6). The coordinator uses the Session to run a retryable write. The retryable writes fails with TransactionTooOld in the TransactionParticipant since the last used txnNumber in this session is 106. We can retry again. But if there is no concurrent DDL or multi-update that would take this one session out of the InternalSessionPool, we would get back this Session and get TransactionTooOld 100 more times.
- is related to
-
SERVER-92658 Investigate whether step downs may lead to an inconsistent state of internal sessions within the pool
- Closed