[SERVER-60872] Deadlock between stepDown and TenantOplogApplier startup Created: 20/Oct/21 Updated: 06/Oct/23 Resolved: 06/Oct/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Lingzhi Deng | Assignee: | [DO NOT USE] Backlog - Server Serverless (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | serverless-shortlist | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Serverless
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Sprint: | Server Serverless 2021-11-01, Server Serverless 2021-11-15, Server Serverless 2021-11-29, Server Serverless 2021-12-13, Server Serverless 2021-12-27, Server Serverless 2022-01-10, Server Serverless 2022-11-28 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||
| Description |
|
stepDown thread: Holding RSTL, blocked on TenantMigrationRecipientService::Instance mutex via interrupt We should also confirm that we don’t have a similar deadlock pattern elsewhere in the tenant migration donor or recipient code when making op ctx under a mutex. |
| Comments |
| Comment by Didier Nadeau [ 08/Mar/23 ] |
|
Moving this back to open as work was deprioritized and we didn't merge the PR. |
| Comment by Steven Vannelli [ 01/Dec/22 ] |
|
Removing from the sprint while Chris finishes up his other tickets. |
| Comment by Suganthi Mani [ 09/Nov/22 ] |
|
Spoke with jason.chan@mongodb.com and george.wangensteen@mongodb.com, we decided to do a short-term quick fix as part of this ticket. This should address both the deadlock bugs mentioned in this ticket. And, created SERVER-71207 to think of a long-term solution and assigned to Service Arch team. |
| Comment by Steven Vannelli [ 07/Nov/22 ] |
|
suganthi.mani@mongodb.com - to sync up with jason.chan@mongodb.com and george.wangensteen@mongodb.com about this ticket and what the next steps should be. |
| Comment by Suganthi Mani [ 16/Nov/21 ] |
|
There is a possibility of another deadlock as well. Lock order violation between primaryOnlyService(Mutex) and TenantMigrationRecipientService(Mutex) can cause deadlock. So, just creating a new opCtx under an instance mutex lock can deadlock with stepdown. And, I believe the same problem exists in other POS services as well. CC matthew.saltz george.wangensteen |