[SERVER-49508] Step up deadlock between migration recovery and prepared transaction Created: 14/Jul/20 Updated: 29/Oct/23 Resolved: 24/Jul/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | Tommaso Tocci |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Sprint: | Sharding 2020-07-27 | ||||
| Participants: | |||||
| Linked BF Score: | 35 | ||||
| Description |
|
As part of stepping up to primary, a node recovers migrations that were active on the previous primary. This involves synchronously scanning every document in config.migrationCoordinators and clearing the metadata for each corresponding namespace, which requires taking a collection X lock. This happens after the primary restores the locks of active prepared transactions (through MongoDSessionCatalog::onStepUp() called here), so if a prepared transaction has a lock on a namespace that had an active migration, taking the collection X lock on that namespace in migration recovery will block, preventing the step up from succeeding. |
| Comments |
| Comment by Githook User [ 24/Jul/20 ] |
|
Author: {'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}Message: |
| Comment by Jack Mulrow [ 21/Jul/20 ] |
|
kaloian.manassiev, yeah the problem is the step up thread taking that X lock. Getting rid of the acquisition should fix this or we could possibly move migration recovery before step up restores the locks for prepared transactions, but I'm not sure what other problems might fall out of that. |
| Comment by Kaloian Manassiev [ 21/Jul/20 ] |
|
Just to be clear – the issue is the X-lock which is being taken to clear the filtering info, right? Because I don't think we need to take it for clearing, it must be a vestige. |
| Comment by Esha Maharishi (Inactive) [ 20/Jul/20 ] |
|
Sounds good, reassigning to tommaso.tocci. Feel free to re-triage as you think makes sense. |
| Comment by Jack Mulrow [ 20/Jul/20 ] |
|
This particular issue should only affect master since the step up thread in 4.4 triggers the entire migration recovery to run on the migration executor whereas in master the step up thread takes a collection lock itself before running the rest of recovery on the migration executor. I also can't reproduce the hang on 4.4 with the test I attached. |
| Comment by Esha Maharishi (Inactive) [ 20/Jul/20 ] |
|
jack.mulrow, could you confirm this issue only affects master, not 4.4? On 4.4, I think migration recovery only acquires a collection IS lock. If so, this doesn't need to block 4.4 GA and I'll pass this over to tommaso.tocci.
|