[SERVER-2975] Replica set master failure detection Created: 21/Apr/11  Updated: 29/May/12  Resolved: 02/May/11

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Mathieu Poumeyrol Assignee: Kristina Chodorow (Inactive)
Resolution: Duplicate Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

linux, ec2, ebs


Issue Links:
Duplicate
duplicates SERVER-3014 DBClientConnection socket timeout doe... Closed
Participants:

 Description   

During the massive EC2 fail earlier this morning, the master of one of our replica set was impacted, not responding to the clients still connected without closing the connections. The other members of the set did not pick the failure up, and it was not possible to send a "stepdown" command to it. (As the computer was not answering ssh, we did a remote reboot to force the replica set on its two other feet).

  • the replica failure detection should be less optimistic
  • It should be possible to trigger election from a secondary in such a situation


 Comments   
Comment by Kristina Chodorow (Inactive) [ 02/May/11 ]

Both, mongos and mongod use the code that was fixed.

Comment by Jonathan Wollman [ 02/May/11 ]

Was this fixed in MongoS or in the server?

Comment by Kristina Chodorow (Inactive) [ 02/May/11 ]

Fixed and backported to 1.8.2.

Comment by Mathieu Poumeyrol [ 28/Apr/11 ]

+1 for a backport to 1.8 — that is, if that's possible. 1.8 is just a few weeks old, 2.0 seams awfully away for something with such concrete availability impact.

Comment by Jonathan Wollman [ 28/Apr/11 ]

Are guys considering this as patch to 1.8 release?

Thx

Sent from my iPhone

Comment by Kristina Chodorow (Inactive) [ 27/Apr/11 ]

That's fine, I've figured out what the bug was and I'll be committing the fix once 1.9.0 is out.

Comment by Mathieu Poumeyrol [ 26/Apr/11 ]

No luck in rescuing logs from the failing master. They were rotated away before AWS could restore our access to the computer.

Comment by Mathieu Poumeyrol [ 21/Apr/11 ]

Sure, but I prefer not to share my ip addresses and stuff with everyone. I'm "kali" on freenode, I already /msged a link to kchodorow_ .

Comment by Eliot Horowitz (Inactive) [ 21/Apr/11 ]

Can you send the logs from one of the secondaries?

Generated at Thu Feb 08 03:01:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.