|
Discussions of the Liveness Monitoring project brought up some concerns and thoughts about directions going forward. schwerin for a long time has believed that mongodb heartbeats cannot be necessary for liveness, particularly because they're not part of the raft spec (Raft has a concept of heartbeats that is more similar to the fact that getMores return periodically even in the absence of new data than to our heartbeats). In fact, heartbeats contribute to the liveness monitoring problem because they cause nodes to stay up regardless of their ability to accept and propagate writes.
The current liveness timeout code is incredibly complicated, and removing heartbeats from liveness monitoring would be a complexity reduction.
The one concern with removing heartbeats from liveness monitoring is that we may start seeing more frequent elections due to the lower frequency of communication between nodes. The way we handle liveness in chained topologies absence heartbeats may also need work.
We will add server status metrics that count, if we removed heartbeats from liveness monitoring and exclusively relied on getMores and updatePosition commands for liveness monitoring, when would we run for election. We can compare this count to the count of actual elections and then see if we would have more elections than we do today.
Just removing heartbeats from liveness monitoring would not handle the case of "primary cannot accept writes but still returns empty getMores reliably". Making secondaries step up in this case would require further design, and this investigation wouldn't help illuminate that.
|