Collection Cache Recoverer can return wrong data in presence of concurrent DDLs

XMLWordPrintableJSON

    • Catalog and Routing
    • Fully Compatible
    • ALL
    • CAR Team 2026-06-22
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Filtering metadata recovery relies on the following DDL pseudo-process:

      • The critical section is acquired
      • Durable changes are made
      • An oplog 'c' entry is produced and notified to the recoverer
      • The critical section is released

      However, the last part of "notified to the recoverer" is missing on primaries as they only record and broadcast to secondaries the oplog entry, they don't notify the recoverer. Instead, the CSR is almost always left with the final metadata values whenever it leaves the critical section. If the latter isn't the case (such as rename or some paths of resharding) then the recoverer could return stale data on primaries if the following were to happen:

      • Recoverer starts, reads from disk and stops just before doing drainAndApply
      • An entire DDL commits on the primary
      • As there has been no recoverer notification the drainAndApply immediately returns
      • Recovery proceeds to install stale metadata as there has been no communication between the DDL and the Recoverer

            Assignee:
            Joan Bruguera Micó
            Reporter:
            Jordi Olivares Provencio
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: