Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-61785

API for detecting conflicts among TenantMigrationDonor POS instances is racy.

    • Type: Icon: Bug Bug
    • Resolution: Works as Designed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Serverless
    • ALL

      SERVER-60752 introduced a new API for detecting conflicts among POS instances. As part of that ticket, to detect different migration id with same tenant id conflicts, the tenant migration donor(TMD) API was made to wait for the existing TMD instance's initial state doc to be majority committed. This can lead to 3-way deadlocks. I think, accidentally, SERVER-60953, fixed that 3 way deadlock issue but made the API (PrimaryOnlyService::checkIfConflictsWithOtherInstances) racy.

      Racy scenario:

      TMD Instance 1
      Migration ID 1 + Tenant ID 1
      TMD Instance 2
      Migration ID 2 + Tenant ID 1
      Calls getOrCreateInstance()  
        Calls getOrCreateInstance()
      • Acquires POS Mutex
        • Calls checkIfConflictsWithOtherInstances()
          • No conflicts detected as Instance 1 durable state is empty
        • Create Instance2.
        • Schedules Instance:run() to run asynchronously
        • Insert Instance2 into POS in-memory map
      • Release POS Mutex
      Instance1::run() starts
      • Persists state doc with Migration ID 1 + Tenant ID 1.
      • Durable state gets updated to non-empty
       
        Instance2::run() starts
      • Persists state doc with Migration ID 2 + Tenant ID 1.

      Additional notes on 3-way deadlock scenario:
      1) Instance 2 holds POS mutex (as part of getOrCreateInstance()) and wait for Instance1 initial state doc to be majority committed.
      2) Stepdown thread holds RSTL in mode X and tries to acquire POS mutex to execute POS onStepDown() (to interrupt active instances) and blocks behind Instance 2.
      3) Instance 1 tries to acquire RSTL in IX mode to write the initial state doc but blocks behind the stepdown thread.

            Assignee:
            backlog-server-serverless [DO NOT USE] Backlog - Server Serverless (Inactive)
            Reporter:
            suganthi.mani@mongodb.com Suganthi Mani
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: