[SERVER-75285] Deadlock between ShardsvrCheckMetadataConsistencyParticipantCommand, prepared transactions, and stepdown Created: 24/Mar/23 Updated: 27/Oct/23 Resolved: 27/Mar/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Samyukta Lanka | Assignee: | Tommaso Tocci |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Sharding EMEA
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 135 | ||||||||||||||||
| Description |
|
ShardsvrCheckMetadataConsistencyParticipantCommand currently takes a DB lock in (Edit: at the time that this deadlock was found, the command took the DB lock in S mode). This can then cause a deadlock with prepared transactions if the transaction is holding the DB lock that checkMetadataConsistency is looking to acquire, but committing the transaction is blocked on a stepdown (as in the node isn't able to replicate the commitTransaction command until it completes stepping down). The order of events is: A targeted way to fix this would be to manually ensure that checkMetadataConsistency is killed by the stepdown thread or make sure it does not hold the RSTL. |
| Comments |
| Comment by Samyukta Lanka [ 27/Mar/23 ] |
I think an amendment based on Jordi's point is that reads that take DB S mode locks should instead be lock free or we do SERVER-75288. |
| Comment by Samyukta Lanka [ 27/Mar/23 ] |
|
That's a great point, I think jordi.serra-torrens@mongodb.com is correct that this can't happen anymore because the IS lock won't conflict with prepared transactions. |
| Comment by Jordi Serra Torrens [ 27/Mar/23 ] |
|
I'd like to point out that on BF-28038, ShardsvrCheckMetadataConsistencyParticipantCommand was trying to acquire the DB lock in MODE_S (rather than IS). The change from S to IS happened as part of I think that's important, because I wouldn't expect ShardsvrCheckMetadataConsistencyParticipantCommand's MODE_IS acquisition to be blocked due to the prepared txn (MODE_IX). MODE_S however, would have blocked. |
| Comment by Kaloian Manassiev [ 27/Mar/23 ] |
|
samy.lanka@mongodb.com, from your explanation it seems like read operations must always be lock-free - am I understanding it correctly? My reading of it is that read operations shouldn't be holding the RSTL lock while waiting for IS locks further down the hierarchy. But that would only be possible if we had some snapshotting mechanism to ensure the read will access a consistent state, i.e. what is present in lock-free reads. Do we have an example of a read operation which must run with locks held? |