[SERVER-17019] HA setup doesn't work if member totally and quickly disappears Created: 23/Jan/15  Updated: 25/Mar/15  Resolved: 25/Mar/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.6
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Kalle Varisvirta Assignee: Andy Schwerin
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

We have a problem with our replica set. It's running on three virtual servers and if any of the mongod's goes down, it normally continues working with the rest. However, if any of the servers totally disappears, i.e. won't respond to network traffic at all (if down, or block all outgoing traffic via firewall, or poweroff the server suddenly), all queries to the replica set take 15 seconds extra. Judging from the network traffic, it's due to TCP retransmits.

This 15 second extra time for every query makes our load balancer think all nodes are down and it shuts down traffic to the whole setup.

Since using console mongo the other replica set members works fine, we originally posted this as a bug in the node.js driver (https://jira.mongodb.org/browse/NODE-350), but later tried with the PHP driver and were able to reproduce a similar (although not identical) behaviour.

We also reproduced this problem in our secondary setup in another data center, so this shouldn't be data center specific. Both might be running the same virtualization platform, though, we haven't looked into that yet.

Any ideas how to go forward with this?



 Comments   
Comment by Andy Schwerin [ 23/Jan/15 ]

Have you tried the experiment ck suggested on NODE-350?

Generated at Thu Feb 08 03:43:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.