-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Catalog and Routing
-
🟩 Routing and Topology
-
None
-
None
-
None
-
None
-
None
-
None
In the course of developing authoritative shards we discovered that there is a potential problem with the protocol that requires us to resolve a split-brain scenario.
In the past we've chosen to avoid this by waiting until a given known majority-available timestamp is available on the node for use. However, this has its own set of issues such as only resolving this situation once the split-brain scenario is solved and causing an unavailability problem until then.
Another way we've handled this in the past is by adding a timeout mechanism which would allow the caller to make a decision on whether to wait more or to abort and retry with a different node its operation. However, this has never been standardized and we thus have a different timeout setting per location that needs to address this, leading to a proliferation of server parameters.
Ideally we would want to have a single timeout that is used across the codebase in order to resolve the split-brain scenario.
This ticket is about finding locations that use such timeouts in order to resolve split-brain scenarios or any equivalent situation that needs to wait for a given node to reach a majority timestamp and to decide how to unify them into a single place.