Investigate how to best approach timeout mechanisms that protect against split-brain scenarios

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • 🟩 Routing and Topology
    • None
    • None
    • None
    • None
    • None
    • None

      In the course of developing authoritative shards we discovered that there is a potential problem with the protocol that requires us to resolve a split-brain scenario.

      In the past we've chosen to avoid this by waiting until a given known majority-available timestamp is available on the node for use. However, this has its own set of issues such as only resolving this situation once the split-brain scenario is solved and causing an unavailability problem until then.

      Another way we've handled this in the past is by adding a timeout mechanism which would allow the caller to make a decision on whether to wait more or to abort and retry with a different node its operation. However, this has never been standardized and we thus have a different timeout setting per location that needs to address this, leading to a proliferation of server parameters.

      Ideally we would want to have a single timeout that is used across the codebase in order to resolve the split-brain scenario.

      This ticket is about finding locations that use such timeouts in order to resolve split-brain scenarios or any equivalent situation that needs to wait for a given node to reach a majority timestamp and to decide how to unify them into a single place.

            Assignee:
            Unassigned
            Reporter:
            Jordi Olivares Provencio
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: