[SERVER-10982] Replica set may not fail over when primary is not responsive Created: 01/Oct/13  Updated: 06/Dec/22  Resolved: 03/Jan/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Gubler Assignee: Backlog - Replication Team
Resolution: Incomplete Votes: 0
Labels: elections
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux mongo0 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1+deb7u1 x86_64 GNU/Linux


Attachments: Text File dmesg.txt    
Issue Links:
Related
is related to SERVER-6028 Too many open connections kills prima... Closed
Assigned Teams:
Replication
Operating System: Linux
Steps To Reproduce:

Not tested under controlled circumstances: Set up replica set, store MongoDB data on separate device on Primary; make that device unresponsive (but keep it mounted).

Participants:

 Description   

We had a hardware issue with our Mongo replica set primary. The exact reason is still unknown, but it appears that I/O commands to its SSD (which holds all MongoDB data but not the operating system or the MongoDB installation itself) did not return.

dmesg output (full output is attached):
[2195482.937229] INFO: task mongod:2731 blocked for more than 120 seconds.
[2195482.937416] mongod D ffff88063fc13780 0 2731 1 0x00000000
[2195482.937421] ffff88033147d1e0 0000000000000086 ffff880600000000 ffff880333239590
[2195482.937426] 0000000000013780 ffff8803324adfd8 ffff8803324adfd8 ffff88033147d1e0
[2195482.937432] ffffffff8101360a 00000001810660a1 ffff8803316822f0 ffff88063fc13fd0
[.....]

MongoDB's log file does not show anything out of the ordinary.

Result:
The replica set's heartbeat though that our primary was fine, but it was not actually doing any work (all it did is wait for a broken disc). Thus connections piled up and our entire application stalled. As soon as I manually shut down MongoDB on that machine, the failover happened as it should (although the Java driver didn't recover properly after that, but that's a separate issue).

Now, I'm not even sure if this is a valid bug report, but I think there is some room for improvement in the replica set's heartbeat code. I can imagine various situations in which a machine is responding to heartbeat, but not actually working, e.g. "swap to death" situations, all sorts of I/O issues (e.g. NFS/iSCSI/whatever mounted file system with network problems), hardware issues similar to the ones we had.



 Comments   
Comment by Judah Schvimer [ 03/Jan/20 ]

This will be addressed in the "Liveness Monitoring in Replica Sets" project (PM-1039). I'm leaving it in that project, but closing as "Incomplete" so we see this and consider this case when designing it.

Comment by Daniel Pasette (Inactive) [ 07/Oct/13 ]

Thanks for the report. We'll have to consider how to detect this scenario in the general case.

Generated at Thu Feb 08 03:24:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.