Pave the path for accelerating configTime wait on readable standby clusters

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      The current design for handling uncomparable versions has the node issue a noop write (following the same pattern as afterClusterTime) to accelerate the configTime wait.

      This noop write does not work on readable standby clusters, where there is no primary in the replica set. In that scenario, the secondary node will still eventually reach the configTime wait because the NoopWriter and heartbeats on the active  cluster gossip the vector clock, but we currently rely on the active cluster to make progress.

      This ticket is not about fully solving the unavailability on readable standby clusters. The goal is to evolve the protocol so that we have a clean signal that we are in this scenario, allowing us to implement the real fix entirely on the router side in the future without requiring a MongoD upgrade and backport, only a MongoS change.

       

      Proposed protocol change

      1. The router role attaches a retry counter to the request (number of times the router loop has retried).
      2. On the shard side, when two uncomparable versions are detected, we attempt the noop write as today.
      3. If the noop write fails (i.e. we are likely on a readable standby cluster), behavior depends on the retry counter:
        • numRetries == 0: fail the request with a new exception StaleConfig-style. The router catches it, increments the counter, and for now simply retries the operation.
        • numRetries > 0: unconditionally wait for configTime (current fallback behavior).

            Assignee:
            Unassigned
            Reporter:
            Pol Pinol
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: