Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-103744

Deadlock between renameCollection, dbHash, and prepared transaction

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • ALL
    • CAR Team 2025-04-28
    • 200
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      We found that when renameCollection is run on the primary, it runs validateAndRunRenameCollection which in this case will call renameCollectionWithinDb. This will take the dblock in IX mode and take the collection locks in X mode (see shardRole).

      However, on the secondary, when we apply a renameCollection oplog entry, we call renameCollectionForApplyOps which always take the dblock in X mode.

      In BF-37281 we see this scenario:
      on the primary we have:
      1. Prepare txn on "db.coll1" will take the dblock in IX mode, and the collection lock on coll1 in X mode. This happens at oplog entry at timestamp 1.
      2. Rename collection "db.coll2" to "db.coll3" successfully runs on the primary, as we take the dblock in IX mode and the source and target collections in X mode, and these are different collections than the one touched by the prepared txn, so there is no conflict. This is in an oplog entry at timestamp 2.
      3. We commit the transaction. The timestamp of this oplog entry is 3.

      On the secondary we:
      1. Prepare txn runs on "db.coll1". We yield locks for prepared txns on secondaries once they are prepared
      2. We run dbhash on "db". This runs outside of oplog application. Dbhash takes the db lock in IS mode. Notably we do not release the lock in between collections and then we go one by one through the collections. For each collection, we take the collection lock in IS Mode, and we also need to wait for prepare conflicts for that collection. So in this case dbhash is waiting on the prepare conflict for "db.coll1".
      3. Rename collection on the secondary runs renameCollectionForApplyOps, which wants to take the db lock in X mode, so it's blocked on dbHash which has it in IS mode.
      4. The commitTransaction oplog entry is after the renameCollection oplog entry, so the secondary cannot resolve the prepare conflict, causing the three way deadlock.

      It seems that renameCollection should be taking the same locks on the primary and the secondary, which would avoid this issue

            Assignee:
            joan.bruguera-mico@mongodb.com Joan Bruguera Micó
            Reporter:
            huayu.ouyang@mongodb.com Huayu Ouyang
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: