[SERVER-66477] Deadlock during stepup when there is a prepared transaction and migration recipient recovery needs to be run Created: 16/May/22  Updated: 15/Sep/23  Resolved: 17/May/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.3.0, 6.0.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jordi Serra Torrens Assignee: Jordi Serra Torrens
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File 0001-BF-25230-fix.patch     Text File 0001-Repro-BF-25230.patch    
Issue Links:
Depends
Related
related to SERVER-65947 MigrationDestinationManager must reco... Closed
is related to SERVER-81116 Prevent deadlock during stepup Open
Operating System: ALL
Steps To Reproduce:

./buildscripts/resmoke.py run --storageEngine=wiredTiger --storageEngineCacheSizeGB=.50 --suite=sharding jstests/sharding/repro-bf-25230.js --log=file

Sprint: Sharding EMEA 2022-05-30
Participants:
Linked BF Score: 153

 Description   

There exists a possible deadlock on stepup involving a particular interleaving of a transaction starting and becoming prepared (ii) and a stepdown during a chunk migration.

Consider the following interleaving:
1.A chunk migration recipient that has exited its critical section but not yet removed its recovery document.
2. As soon as the critical section was released, a new prepared transaction could have started and reached the prepare state.
3. The recipient primary down steps down. because the migration recipient recovery document still exists, the upcoming primary will need to recover it. Then, still under drain mode, this involves reacquiring the critical section, which requires taking the collection lock in MODE_S.
However, because there was a prepared transaction (whose locks are reacquired earlier in the stepup sequence), the migration recovery won't be able to acquire the lock and will deadlock on stepup.



 Comments   
Comment by Jordi Serra Torrens [ 17/May/22 ]

This has been addressed by SERVER-65947.

Comment by Jordi Serra Torrens [ 16/May/22 ]

This could be addressed by moving the reacquisition of the migration recipient critical section to happen outside of drain mode, on the migrateThread instead. Additionally, we should then ensure that _recvChunkCommit waits for the critical section to be majority committed (we only wait for local wc now).
Justification: On the happy path, when _recvChunkCommit returns it will be guaranteed that the recipient is holding the critical section. Also, since we use the RecoverableCriticalSection, if the recipient steps down after this point, the critical section will already persist. It is not necessary to retake it under drain mode.

Possible fix patch: 0001-BF-25230-fix.patch

The current PR for SERVER-65947 already implements this solution. Leaving this ticket open until decision is made about possibly backporting SERVER-65947 to 6.0.

Generated at Thu Feb 08 06:05:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.