Operations that check feature flags with enable_on_transitional_fcv: true and write to disk can race with setFCV downgrade

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • ALL
    • CAR Team 2025-06-23, CAR Team 2025-07-07
    • None
    • 3
    • TBD
    • 🟩 Routing and Topology
    • None
    • None
    • None
    • None
    • None
    • None

      There can be this sequence of events:
      1. Start FCV downgrade, write FCV downgrading oplog entry (FCV is in downgrading_to_XY)
      2. FCV takes/releases global lock
      3. Start separate operation, take global lock and check feature flag. The FCV is now in downgrading_to_XY, but since enable_on_transitional_fcv is set to true, the feature flag is enabled.
      4. FCV downgraded oplog entry
      5. Finish operation, which based on the feature flag check, writes in upgraded format (such as writing an oplog entry) or communicates with other nodes with an upgraded protocol. But if you look at the sequence of oplog entries this is coming after the FCV downgraded oplog entry, which means the feature flag should be disabled now, so this is inconsistent.

      After: Importantly, there’s no guarantee that all writes after the complete FCV downgrade will be in the downgraded format. If binaries are swapped to downgraded binaries immediately this could be an issue since the nodes on the downgraded binary would not be able to read an oplog entry in the upgraded format.

      I believe there are only three use cases of enable_on_transitional_fcv: true right now. We should check if they write to disk/do inter-node communication and if this race could be a concern.

            Assignee:
            Joan Bruguera Micó
            Reporter:
            Huayu Ouyang
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: