[SERVER-49508] Step up deadlock between migration recovery and prepared transaction Created: 14/Jul/20  Updated: 29/Oct/23  Resolved: 24/Jul/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Tommaso Tocci
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File stepup_deadlock_repro.js    
Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2020-07-27
Participants:
Linked BF Score: 35

 Description   

As part of stepping up to primary, a node recovers migrations that were active on the previous primary. This involves synchronously scanning every document in config.migrationCoordinators and clearing the metadata for each corresponding namespace, which requires taking a collection X lock. This happens after the primary restores the locks of active prepared transactions (through MongoDSessionCatalog::onStepUp() called here), so if a prepared transaction has a lock on a namespace that had an active migration, taking the collection X lock on that namespace in migration recovery will block, preventing the step up from succeeding.



 Comments   
Comment by Githook User [ 24/Jul/20 ]

Author:

{'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}

Message: SERVER-49508 Step up deadlock between migration recovery and prepared transaction
Branch: master
https://github.com/mongodb/mongo/commit/c857e1dcb22bb56af4302a9f0997106996ad1f3f

Comment by Jack Mulrow [ 21/Jul/20 ]

kaloian.manassiev, yeah the problem is the step up thread taking that X lock. Getting rid of the acquisition should fix this or we could possibly move migration recovery before step up restores the locks for prepared transactions, but I'm not sure what other problems might fall out of that.

Comment by Kaloian Manassiev [ 21/Jul/20 ]

Just to be clear – the issue is the X-lock which is being taken to clear the filtering info, right? Because I don't think we need to take it for clearing, it must be a vestige.

Comment by Esha Maharishi (Inactive) [ 20/Jul/20 ]

Sounds good, reassigning to tommaso.tocci. Feel free to re-triage as you think makes sense.

Comment by Jack Mulrow [ 20/Jul/20 ]

This particular issue should only affect master since the step up thread in 4.4 triggers the entire migration recovery to run on the migration executor whereas in master the step up thread takes a collection lock itself before running the rest of recovery on the migration executor. I also can't reproduce the hang on 4.4 with the test I attached.

Comment by Esha Maharishi (Inactive) [ 20/Jul/20 ]

jack.mulrow, could you confirm this issue only affects master, not 4.4?

On 4.4, I think migration recovery only acquires a collection IS lock.

If so, this doesn't need to block 4.4 GA and I'll pass this over to tommaso.tocci.

 

Generated at Thu Feb 08 05:20:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.