[SERVER-76884] [v4.4] Chunk migration recovery can deadlock on stepup taking MigrationBlockingGuard (v4.4 only) Created: 05/May/23  Updated: 29/Oct/23  Resolved: 11/May/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.0
Fix Version/s: 4.4.23

Type: Bug Priority: Major - P3
Reporter: Jordi Serra Torrens Assignee: Jordi Serra Torrens
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File repro-bf-28646.patch    
Issue Links:
Depends
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding EMEA 2023-05-15
Participants:
Linked BF Score: 31

 Description   

[Note this can only happen on v4.4. v5.0 and onwards no longer take the MigrationBlockingGuard on stepup]

The deadlock is the following:
(1) The OplogApplier thread is holding the RSTL lock in MODE_X for stepup and is blocked trying to acquire the MigrationBlockingGuard.
(2) A chunk migration donor is still registered in the active migration registry which means that (1) will be blocked when it tries to acquire the MigrationBlockingGuard. The migration is trying to perform a write on the local replica set but it is blocked trying to acquire the RSTL lock (held by (1))

The (rare) interleaving that trigges this deadlock may happen when a request the donor sent to the recipient took a long time to be processed – enough time for a particular donor node to step down and up back again (the same node).



 Comments   
Comment by Githook User [ 11/May/23 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-76884 Interrupt completeMigration on stepdown
Branch: v4.4
https://github.com/mongodb/mongo/commit/75e3560c5307de7c80880eea83aa259312c51bb4

Generated at Thu Feb 08 06:33:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.