[SERVER-8148] Implement Phi Accrual Failure Detection for detecting Node Failure Created: 11/Jan/13  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Caleb Jones Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: elections
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Replication
Participants:

 Description   

I did some reading on how Cassandra does its internal checking and it implements a phi-accrual detection algorithm which is more sensitive to dynamic network conditions than a simple heartbeat. It also provides a scalar failure measurement instead of a binary yes/no detection which allows for configuration of tolerance levels.

See:
http://ddg.jaist.ac.jp/pub/HDY+04.pdf

There are pros/cons (particularly around simplicity), but I'd be curious what you at 10gen think about the appropriateness/usefulness of basing your failure detection off of this kind of a protocol.


Generated at Thu Feb 08 03:16:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.