[SERVER-66340] Improve distributed transaction commit locking behavior Created: 10/May/22  Updated: 14/Dec/23  Resolved: 14/Dec/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.3.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Gregory Noma Assignee: Josef Ahmad
Resolution: Fixed Votes: 0
Labels: car-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-79922 Investigate solutions to the fsync (l... Backlog
is depended on by SERVER-66342 Remove resourceIdFeatureCompatibility... Closed
Duplicate
is duplicated by SERVER-66341 Improve journal flusher locking behavior Closed
Related
related to SERVER-65821 Deadlock during setFCV when there are... Closed
related to SERVER-66341 Improve journal flusher locking behavior Closed
Assigned Teams:
Catalog and Routing
Backwards Compatibility: Fully Compatible
Sprint: CAR Team 2023-12-25
Participants:
Story Points: 3

 Description   

In a distributed transaction, when the coordinator writes its commit decision, it takes global/db/collection locks to do so while the participant is also still holding the global lock. This can cause issues when combined with operations that take strong global locks, as seen in SERVER-65821. We should investigate ways to improve this behavior so that we can remove the workaround added in that ticket. One potential idea is to consolidate the participant and coordinator on the coordinating shard, so that they can make progress with a single set of resources.



 Comments   
Comment by Githook User [ 13/Dec/23 ]

Author:

{'name': 'Josef Ahmad', 'email': 'josef.ahmad@mongodb.com', 'username': 'josefahmad'}

Message: SERVER-66340 Generalise the FeatureCompatibilityVersion lock into a MultiDocumentTransactionsBarrier

This barrier resolves a class of deadlocks involving an uncommitted transaction in the
prepared state and a global lock acquisition request in non-intent mode. Without this
barrier, this would result in a three-way deadlock between (A) a strong global lock
request which depends on (B) a prepared transaction which holds the global lock in intent
mode which depends on (C) the transaction coordinator persisting the commit/abort
decision, which depends on (A), thus establishing a circular dependency.

The FeatureCompatibilityVersion lock was initially introduced to address this issue
during FCV upgrades/downgrades, which acquire the global lock in MODE_S. This
patch generalises this lock into a MultiDocumentTransactionsBarrier to address any
non-intent global lock acquisitions, including fsyncLock.
The MultiDocumentTransactionsBarrier transparently drains prepared transactions
before acquiring the global lock in non-intent mode. It is implicitly acquired as part of a
global lock acquisition by operations processing transaction statements, and by operations
acquiring the global lock in non-intent mode. All other global lock requests skip this
lock. It is acquired in the same mode as the requested global lock mode.

One (benign) behavioural change is that prepared transactions that commit while a global
lock request is enqueued will not be able to acknowledge their write concern to the client
until the global lock is rescinded.

GitOrigin-RevId: 6517ac7482c0aa6da26af1b530bd5755737d6d1e
Branch: master
https://github.com/mongodb/mongo/commit/1888cae7b06f7944437661794559470c26672177

Comment by Geert Bosch [ 25/Aug/22 ]

I chatted with Max, and we agree that this ticket is not ready for work as written, but rather indicates that there's some larger rethink/remodel that is necessary to be able to remove the current workaround. I'l put this on the storage execution backlog for now to see if there is a project that we can make this part of.

Comment by Max Hirschhorn [ 25/Aug/22 ]

geert.bosch@mongodb.com, I'd like to propose we close this ticket without making any changes and leave the new resourceIdFeatureCompatibilityVersion resource as-is. I think attempting to have the TransactionCoordinator and TransactionParticipant share their TxnResources (read: LockManager locks) is a challenging undertaking and isn't fully sufficient as an alternative for SERVER-65821. We would also need the WaitForMajorityService to receive the global IS lock from those TxnResources plus probably share with other Clients which the system happens to interact with as well.

I feel like approaching the TransactionCoordinator differently ties into approaching the threading and Client models differently for the whole server.

Are you comfortable with leaving things as-is?

Generated at Thu Feb 08 06:05:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.