[SERVER-60872] Deadlock between stepDown and TenantOplogApplier startup Created: 20/Oct/21  Updated: 06/Oct/23  Resolved: 06/Oct/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Lingzhi Deng Assignee: [DO NOT USE] Backlog - Server Serverless (Inactive)
Resolution: Won't Fix Votes: 0
Labels: serverless-shortlist
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-71207 Lock ordering violation between POS m... Open
related to SERVER-53996 Recipient should not do reads/writes ... Closed
related to SERVER-52723 Handle oplog application restart in T... Closed
Assigned Teams:
Serverless
Operating System: ALL
Sprint: Server Serverless 2021-11-01, Server Serverless 2021-11-15, Server Serverless 2021-11-29, Server Serverless 2021-12-13, Server Serverless 2021-12-27, Server Serverless 2022-01-10, Server Serverless 2022-11-28
Participants:
Linked BF Score: 0

 Description   

stepDown thread: Holding RSTL, blocked on TenantMigrationRecipientService::Instance mutex via interrupt
recipientService thread: Holding TenantMigrationRecipientService::Instance mutex, blocked on RSTL via TenantOplogBatcher->startup() and the opCtx wasn't interrupted because the PrimaryOnlyService interrupts instances first before killing opCtxs.

We should also confirm that we don’t have a similar deadlock pattern elsewhere in the tenant migration donor or recipient code when making op ctx under a mutex.



 Comments   
Comment by Didier Nadeau [ 08/Mar/23 ]

Moving this back to open as work was deprioritized and we didn't merge the PR.

Comment by Steven Vannelli [ 01/Dec/22 ]

Removing from the sprint while Chris finishes up his other tickets.

Comment by Suganthi Mani [ 09/Nov/22 ]

Spoke with jason.chan@mongodb.com and george.wangensteen@mongodb.com, we decided to do a short-term quick fix as part of this ticket. This should address both the deadlock bugs mentioned in this ticket. And, created SERVER-71207 to think of a long-term solution and assigned to Service Arch team.

Comment by Steven Vannelli [ 07/Nov/22 ]

suganthi.mani@mongodb.com - to sync up with jason.chan@mongodb.com and george.wangensteen@mongodb.com about this ticket and what the next steps should be.

Comment by Suganthi Mani [ 16/Nov/21 ]

There is a possibility of another deadlock as well.
1) Tenantmigration oplog applier startup thread:
TenantMigrationRecipientService(Mutex) -> OplogApplier(Mutex) -> OplogBatcher(Mutex) -> primaryOnlyService(Mutex)(PrimaryOnlyService::registerOpCtx()).
2) StepDown thread:
RSTL (in X mode) -> primaryOnlyService(Mutex) -> TenantMigrationRecipientService(Mutex) (TenantMigrationRecipientService::Instance::interrupt())

Lock order violation between  primaryOnlyService(Mutex)  and TenantMigrationRecipientService(Mutex) can cause deadlock. So, just creating a new opCtx under an instance mutex lock can deadlock with stepdown. And, I believe the same problem exists in other POS services as well. CC matthew.saltz george.wangensteen

Generated at Thu Feb 08 05:50:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.