[SERVER-6037] RS members should report on heartbeat if they cannot reach the node hb-ing them Created: 07/Jun/12  Updated: 11/Jul/16  Resolved: 17/Dec/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.4
Fix Version/s: 2.3.2

Type: Bug Priority: Trivial - P5
Reporter: Mike Hobbs Assignee: Eric Milkie
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 10.04 in Amazon EC2


Issue Links:
Related
related to SERVER-1929 handle replica set flapping Closed
Participants:

 Description   

We have been testing replica set reliability under a few different failure scenarios. One scenario that failed is when we misconfigured network routing to a mongod primary. We blocked all inbound traffic to port 27017, but allowed it to continue making outbound connections. The replica set was a 3-node set where the primary (node A) had a higher priority than the other two (node B and node C).

What happened when we blocked port 27017 to node A is that node B assumed the primary role, as expected. However, node A then made an outbound connection to node B, and since it had a higher priority A told B to step down as primary, which it did. However, since neither B nor C could make a connection to node A, they both eventually voted that node B should become master again. A again connects to B and the whole process repeats indefinitely.

Not that this is at all a typical failure scenario, but I'm thinking that node A should not have been able to tell B to step down as primary in this situation.

Here are the relevant log entries from node A:
Mon Jun 4 15:23:29 [ReplicaSetMonitorWatcher] trying reconnect to graphdb-4-2.strcst.net:27017
Mon Jun 4 15:23:29 [ReplicaSetMonitorWatcher] reconnect graphdb-4-2.strcst.net:27017 ok
Mon Jun 4 15:23:31 [rsHealthPoll] replSet member graphdb-4-2.strcst.net:27017 is now in state PRIMARY
Mon Jun 4 15:23:32 [rsSync] replSet syncing to: graphdb-4-2.strcst.net:27017
Mon Jun 4 15:23:37 [rsMgr] stepping down graphdb-4-2.strcst.net:27017
Mon Jun 4 15:23:37 [rsSync] replSet syncThread: 10278 dbclient error communicating with server: graphdb-4-2.strcst.net:27017
Mon Jun 4 15:23:37 [rsHealthPoll] replSet member graphdb-4-2.strcst.net:27017 is now in state SECONDARY

And here are the corresponding log entries from node B:
Mon Jun 4 15:23:25 [initandlisten] connection accepted from 10.209.29.204:56081 #419426
Mon Jun 4 15:23:30 [rsMgr] replSet info electSelf 1
Mon Jun 4 15:23:30 [rsMgr] replSet PRIMARY
Mon Jun 4 15:23:37 [conn419426] replSet info stepping down as primary secs=1
Mon Jun 4 15:23:37 [conn419426] replSet relinquishing primary state
Mon Jun 4 15:23:37 [conn419426] replSet SECONDARY
Mon Jun 4 15:23:37 [conn419426] replSet closing client sockets after reqlinquishing primary



 Comments   
Comment by auto [ 08/Dec/12 ]

Author:

{u'date': u'2012-12-07T20:06:10Z', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}

Message: SERVER-6037 always log missed heartbeats
Branch: master
https://github.com/mongodb/mongo/commit/16bec3dd4cffa6d9c30af70472bd058e3c9e2916

Comment by Daniel Pasette (Inactive) [ 21/Nov/12 ]

This was actually fixed as part of SERVER-1929, but the message is logged at level 1. We should fix this to log this condition always.

Commit: https://github.com/mongodb/mongo/commit/b35e7705df9c090fa86db8a2c1ca52437b9aeaf1

Comment by Kristina Chodorow (Inactive) [ 05/Sep/12 ]

This can be fixed as part of the flapping fix.

Comment by Kristina Chodorow (Inactive) [ 08/Jun/12 ]

I think the problem is with how B & C are handling this (there is no mechanism for them to tell A that A is unreachable). When A connects to B (say) and asks it for status, B should report that it thinks A is down. Shouldn't be too hard to fix, but it's too late in 2.2 dev cycle to make it in. Congratulations on finding a new edge case!

Generated at Thu Feb 08 03:10:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.