-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Cluster Scalability
-
Sharding EMEA 2023-02-06, Sharding EMEA 2023-04-17, Sharding EMEA 2023-05-01, Sharding EMEA 2023-05-15, Sharding EMEA 2023-05-29, Sharding EMEA 2023-06-12
-
3
During mongos shutdown procedure, after waiting for quiesce period we attempt to abort all the outstanding transactions by sending the abortTransaction to the relevant shards. This logic was added in SERVER-39692 and does not provide strong guarantees, in fact:
- After 15 seconds we will just give up.
- If an error occurred during the forwarding of the abortCommand we also give up.
So I would define this as "best effort" approach to abort transactions, but on the other side we use Shard::RetryPolicy::kIdempotent. As a consequence if some shard is unreachable (crashed or already shut down) the mongos will keep retrying sending the abortTransaction command for 15 seconds slowing down the mongos shutdown procedure.
So practically if we don't want to hit this 15 seconds delay when shutting down a cluster we must always ensure the mongos is shut down before the shards.
My proposal is to make this logic truly best effort and use runFireAndForgetCommand to send the abortTransaction command. This will guarantee that if some node is unreachable we won't delay mongos shutdown.
- related to
-
SERVER-73415 Parallelize python test fixture teardown
-
- Closed
-