[SERVER-60521] Deadlock on stepup due to moveChunk command running uninterrupted on secondary Created: 07/Oct/21 Updated: 27/Oct/23 Resolved: 16/Feb/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.4.0, 5.0.0, 5.1.0-rc0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jordi Serra Torrens | Assignee: | Sergi Mateo Bellido |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | sharding-wfbf-sprint, shardingemea-qw | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Steps To Reproduce: | |||||||||||||||||||||
| Sprint: | Sharding EMEA 2021-10-18, Sharding EMEA 2022-02-21 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||
| Description |
|
Consider a shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration. Consider the following interleaving: To fix this we should make sure that moveChunk cannot run uninterrupted on a secondary. |
| Comments |
| Comment by Sergi Mateo Bellido [ 16/Feb/22 ] |
|
We did several fixes to the moveChunk recently, removing this deadlock. With jordi.serra-torrens we analyzed what would happen in that scenario and everything seemed ok. |
| Comment by Sergi Mateo Bellido [ 15/Feb/22 ] |
|
The deadlock described in this ticket cannot happen anymore since we don't acquire the MigrationBlockingGuard as part of resumeMigrationCoordinationsOnStepUp ( Before analyzing what it would happen on master, I would like to mention two relevant tasks that we implemented recently:
About the original problem (no changes from 1 to 5): |
| Comment by Kaloian Manassiev [ 18/Nov/21 ] |
|
This is still a problem, but very unlikely and is not causing noise in our testing. Fix requires an iteration so putting it under the WFBF sprint category. |