Idea: feature flag improvement for MQL and wire protocol changes

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • None
    • 3
    • TBD
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Background

      Consider a server change like SERVER-91281 which wants to add a new option to the $sort stage to understand/accept a new option 'outputSortKeyMetadata: true/false'. This is challenging to roll out in the face of a rolling upgrade/downgrade.

      If you want to use this new option for optimizations (for example, $setWindowFields would like to take advantage of this), you need to be careful to only do so when you know that any node (mongod) participating in the query will understand your request. A pipeline may need to be routed across the network in the face of sharded collections, and unknown options are typically rejected, which would fail the query.

      Our previous solution to this problem was to either
      (a) mistakenly conclude that because the router is the last node upgraded in the upgrade cycle, the router can send new options with the confidence that all other nodes will understand it. This is not correct, but hard to catch in tests. A query or sub-pipeline may be routed from one mongod to another in the case of a $lookup operation acting as a router on a shard to go find the base collection data. In this scenario, there is no guarantee which mongod version is sending the request, and which mongod version is receiving the request.
      (b) Use an FCV-gated feature flag to check. This mostly works, or at least has been our answer historically. But it still raises the possibility of edge cases where one node checks the FCV and gets a different answer than another node. For this example, I don't think this poses a real problem, since a mixed-response FCV does imply the binary version is upgraded. But it is ... challenging to reason about.

      Proposal

      I was speaking with joan.bruguera-mico@mongodb.com about this and we think we can/should copy the approach done in https://jira.mongodb.org/browse/SPM-4042 for a related problem. Namely, have the first router role participating in any operation be the resolver of any/all feature flags, and pass their resolved values across the network to any participating nodes.

      In this approach, there is no room for races where nodes get different answers about a flag. It also solves the problems highlighted above.

      Sadly, there is one last edge case: an operation can originate from a shard, if it's an internal operation. If originating from the shard, we cannot apply this logic:
      > the router is the last node upgraded in the upgrade cycle, the router can send new options with the confidence that all other nodes will understand it.

      To solve this, our best answer is: use the FCV to resolve feature flags when it is available (you are a data-bearing node), and default to latest if it is not available (you are a mongos router, which does not track FCV).

      (Final note: I have not discussed the other FCV-motivation for such changes: catalog persistence of language features via views or collection validators. I won't discuss here but there are other ideas to improve or at least better document that edge case)

            Assignee:
            Unassigned
            Reporter:
            Charlie Swanson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: