[SERVER-10982] Replica set may not fail over when primary is not responsive Created: 01/Oct/13 Updated: 06/Dec/22 Resolved: 03/Jan/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Gubler | Assignee: | Backlog - Replication Team |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | elections | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux mongo0 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1+deb7u1 x86_64 GNU/Linux |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Replication
|
||||||||
| Operating System: | Linux | ||||||||
| Steps To Reproduce: | Not tested under controlled circumstances: Set up replica set, store MongoDB data on separate device on Primary; make that device unresponsive (but keep it mounted). |
||||||||
| Participants: | |||||||||
| Description |
|
We had a hardware issue with our Mongo replica set primary. The exact reason is still unknown, but it appears that I/O commands to its SSD (which holds all MongoDB data but not the operating system or the MongoDB installation itself) did not return. dmesg output (full output is attached): MongoDB's log file does not show anything out of the ordinary. Result: Now, I'm not even sure if this is a valid bug report, but I think there is some room for improvement in the replica set's heartbeat code. I can imagine various situations in which a machine is responding to heartbeat, but not actually working, e.g. "swap to death" situations, all sorts of I/O issues (e.g. NFS/iSCSI/whatever mounted file system with network problems), hardware issues similar to the ones we had. |
| Comments |
| Comment by Judah Schvimer [ 03/Jan/20 ] |
|
This will be addressed in the "Liveness Monitoring in Replica Sets" project (PM-1039). I'm leaving it in that project, but closing as "Incomplete" so we see this and consider this case when designing it. |
| Comment by Daniel Pasette (Inactive) [ 07/Oct/13 ] |
|
Thanks for the report. We'll have to consider how to detect this scenario in the general case. |