[SERVER-77667] Prevent mongos from starting new transactions at shutdown Created: 31/May/23 Updated: 16/Jan/24 Resolved: 10/Jan/24 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 7.2.1, 7.3.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | Wenqin Ye |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam3 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Cluster Scalability
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Backport Requested: |
v7.2
|
||||||||
| Sprint: | Cluster Scalability 2024-1-8, Cluster Scalability 2024-1-22 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 135 | ||||||||
| Story Points: | 3 | ||||||||
| Description |
|
At shutdown, the mongos process performs a best-effort attempt to abort any transactions the mongos process may have started. This is beneficial for freeing up transaction resources more quickly because the client/driver must retry any in-progress transactions which haven't had their commit coordination handed off already. This is because the transaction protocol does not support committing a multi-statement transaction through a different mongos from the mongos which originally ran the read/write operations. The different mongos can only be used to recover the original commit xor abort decision for the transaction. The implicitlyAbortAllTransactions() function which performs this best-effort attempt to abort any transactions the mongos process may have started. However it doesn't prevent TransactionRouter from being used by a not-yet-interrupted OperationContext and starting a new transaction on a shard. Ordinarily this would be an issue because mongos shutting down is rare and the transaction would eventually be aborted on the shard after the transactionLifetimeLimitSeconds (= 60 seconds by default). In testing the transactionLifetimeLimitSeconds server parameter is set to 24 hours to catch cases where a transaction is unintentionally "leaked" by the system. While the system has liveness through the PeriodicThreadToAbortExpiredTransactions job, a stall would be undesirable to have happen in production. One place in testing we've seen show up where new transactions are being started while the mongos process is shutting down is with the ClusterServerParameterRefresher thread reading from the config server primary in a multi-statement transaction. The MODE_IX lock held on the config server primary prevents the testing infrastructure from running its data consistency checks before shutting down the config server replica set. One idea to improve implicitlyAbortAllTransactions() is to set a flag on the SessionCatalog indicating process shutdown has begun. TransactionRouter instances which are obtained from the SessionCatalog can check whether this flag has been set and throw an InterruptedAtShutdown or equivalent error to prevent the mongos process from starting any new transactions. |
| Comments |
| Comment by Githook User [ 12/Jan/24 ] |
|
Author: {'name': 'Wenqin Ye', 'email': 'wenqin908@gmail.com', 'username': 'wenqinYe'}Message: |
| Comment by Githook User [ 09/Jan/24 ] |
|
Author: {'name': 'Wenqin', 'email': 'wenqinYe@users.noreply.github.com', 'username': 'wenqinYe'}Message: GitOrigin-RevId: 9d3e5e69b0623a89e357e2ed0967b9011ca8f74c |
| Comment by Max Hirschhorn [ 31/May/23 ] |
|
Relatedly, we should consider whether mongod shutdown would benefit from logic similar to implicitlyAbortAllTransactions() from mongos_main.cpp. I can imagine we would move this logic from mongos_main.cpp into something common between mongos and a router-mongod process as part of PM-635. Yet it may also that the introduction of internal transactions and mongod already being able to act as a router means there'd be benefit in earlier binVersions as well. CC antonio.fuschetto@mongodb.com |