[SERVER-6037] RS members should report on heartbeat if they cannot reach the node hb-ing them Created: 07/Jun/12 Updated: 11/Jul/16 Resolved: 17/Dec/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.4 |
| Fix Version/s: | 2.3.2 |
| Type: | Bug | Priority: | Trivial - P5 |
| Reporter: | Mike Hobbs | Assignee: | Eric Milkie |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 10.04 in Amazon EC2 |
||
| Issue Links: |
|
||||||||
| Participants: | |||||||||
| Description |
|
We have been testing replica set reliability under a few different failure scenarios. One scenario that failed is when we misconfigured network routing to a mongod primary. We blocked all inbound traffic to port 27017, but allowed it to continue making outbound connections. The replica set was a 3-node set where the primary (node A) had a higher priority than the other two (node B and node C). What happened when we blocked port 27017 to node A is that node B assumed the primary role, as expected. However, node A then made an outbound connection to node B, and since it had a higher priority A told B to step down as primary, which it did. However, since neither B nor C could make a connection to node A, they both eventually voted that node B should become master again. A again connects to B and the whole process repeats indefinitely. Not that this is at all a typical failure scenario, but I'm thinking that node A should not have been able to tell B to step down as primary in this situation. Here are the relevant log entries from node A: And here are the corresponding log entries from node B: |
| Comments |
| Comment by auto [ 08/Dec/12 ] |
|
Author: {u'date': u'2012-12-07T20:06:10Z', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: |
| Comment by Daniel Pasette (Inactive) [ 21/Nov/12 ] |
|
This was actually fixed as part of Commit: https://github.com/mongodb/mongo/commit/b35e7705df9c090fa86db8a2c1ca52437b9aeaf1 |
| Comment by Kristina Chodorow (Inactive) [ 05/Sep/12 ] |
|
This can be fixed as part of the flapping fix. |
| Comment by Kristina Chodorow (Inactive) [ 08/Jun/12 ] |
|
I think the problem is with how B & C are handling this (there is no mechanism for them to tell A that A is unreachable). When A connects to B (say) and asks it for status, B should report that it thinks A is down. Shouldn't be too hard to fix, but it's too late in 2.2 dev cycle to make it in. Congratulations on finding a new edge case! |