[SERVER-73338] Make aborting transactions on mongos shutdown best effort Created: 26/Jan/23 Updated: 26/Oct/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Tommaso Tocci | Assignee: | Backlog - Catalog and Routing |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | oldshardingemea, shardingemea-qw | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Catalog and Routing
|
||||||||
| Sprint: | Sharding EMEA 2023-02-06, Sharding EMEA 2023-04-17, Sharding EMEA 2023-05-01, Sharding EMEA 2023-05-15, Sharding EMEA 2023-05-29, Sharding EMEA 2023-06-12 | ||||||||
| Participants: | |||||||||
| Story Points: | 3 | ||||||||
| Description |
|
During mongos shutdown procedure, after waiting for quiesce period we attempt to abort all the outstanding transactions by sending the abortTransaction to the relevant shards. This logic was added in
So I would define this as "best effort" approach to abort transactions, but on the other side we use Shard::RetryPolicy::kIdempotent. As a consequence if some shard is unreachable (crashed or already shut down) the mongos will keep retrying sending the abortTransaction command for 15 seconds slowing down the mongos shutdown procedure. So practically if we don't want to hit this 15 seconds delay when shutting down a cluster we must always ensure the mongos is shut down before the shards. My proposal is to make this logic truly best effort and use runFireAndForgetCommand to send the abortTransaction command. This will guarantee that if some node is unreachable we won't delay mongos shutdown. |
| Comments |
| Comment by Max Hirschhorn [ 26/Jan/23 ] |
I do see some value for mongos retrying the abortTransaction command and waiting 15 seconds at shutdown in a production server. If a transaction is left idle on the shards then this can lead to 60 seconds (transactionLifetimeLimitSeconds) of unavailability to the documents modified by the transaction. I wonder if because the motivation for this change is around quickly shutting down a sharded cluster used for local / Evergreen testing that we should approach the 15-second deadline similar to what was done in |