[SERVER-5917] More advanced availability checks required for Arbiter Created: 24/May/12  Updated: 06/Dec/22

Status: Open
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.5
Fix Version/s: Needs Further Definition

Type: Improvement Priority: Major - P3
Reporter: Leon Mergen Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 1
Labels: majority
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Replication
Participants:

 Description   

We have a replica set setup with 2 nodes + 1 arbiter. The SAN storage in our primary's DC went down(ish), resulting in timeouts on I/O operations. Secondary node had detected that the primary was unreachable, but arbiter still marked the primary as available, since the TCP connection to the primary was still active. Failover to the secondary node did not work, causing downtime.

The arbiter should periodically query the different nodes with an I/O operation to detect whether the underlying I/O subsystem is still working.



 Comments   
Comment by Eric Milkie [ 23/Feb/15 ]

Since databases and journals can reside on different devices on one node, it's not clear how to figure out which devices are relevant.

Generated at Thu Feb 08 03:10:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.