[SERVER-61785] API for detecting conflicts among TenantMigrationDonor POS instances is racy. Created: 29/Nov/21  Updated: 27/Oct/23  Resolved: 30/Nov/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: [DO NOT USE] Backlog - Server Serverless (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-60752 API for detecting conflicts among Pri... Closed
is related to SERVER-60953 TenantMigrationDonorService::getDurab... Closed
Assigned Teams:
Serverless
Operating System: ALL
Participants:

 Description   

SERVER-60752 introduced a new API for detecting conflicts among POS instances. As part of that ticket, to detect different migration id with same tenant id conflicts, the tenant migration donor(TMD) API was made to wait for the existing TMD instance's initial state doc to be majority committed. This can lead to 3-way deadlocks. I think, accidentally, SERVER-60953, fixed that 3 way deadlock issue but made the API (PrimaryOnlyService::checkIfConflictsWithOtherInstances) racy.

Racy scenario:

TMD Instance 1
Migration ID 1 + Tenant ID 1
TMD Instance 2
Migration ID 2 + Tenant ID 1
Calls getOrCreateInstance()  
  Calls getOrCreateInstance()
  • Acquires POS Mutex
    • Calls checkIfConflictsWithOtherInstances()
      • No conflicts detected as Instance 1 durable state is empty
    • Create Instance2.
    • Schedules Instance:run() to run asynchronously
    • Insert Instance2 into POS in-memory map
  • Release POS Mutex
Instance1::run() starts
  • Persists state doc with Migration ID 1 + Tenant ID 1.
  • Durable state gets updated to non-empty
 
  Instance2::run() starts
  • Persists state doc with Migration ID 2 + Tenant ID 1.

Additional notes on 3-way deadlock scenario:
1) Instance 2 holds POS mutex (as part of getOrCreateInstance()) and wait for Instance1 initial state doc to be majority committed.
2) Stepdown thread holds RSTL in mode X and tries to acquire POS mutex to execute POS onStepDown() (to interrupt active instances) and blocks behind Instance 2.
3) Instance 1 tries to acquire RSTL in IX mode to write the initial state doc but blocks behind the stepdown thread.



 Comments   
Comment by Suganthi Mani [ 30/Nov/21 ]

I misread the code. I somehow thought, we skip the tenant id conflict check when the durable state of the existing instance is empty. The code has no racy issues. So, closing this ticket.

Generated at Thu Feb 08 05:53:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.