[SERVER-26216] ReplicaSet servers able to make a putsch due to bad network timeout parameters Created: 21/Sep/16  Updated: 07/Apr/17  Resolved: 23/Sep/16

Status: Closed
Project: Core Server
Component/s: Networking, Replication
Affects Version/s: 3.2.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthieu Rigal Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

Take a running RS with 5 members in two networks of 3 (including arbiter) and 2 nodes. Cut the network for a while. Then re-establish it with delaying the network process on the non-arbiters machines (especially the primary), one of the other nodes will become PRIMARY even if the arbiter is still able to communicate with the PRIMARY

Participants:

 Description   

Hi guys,

We recently had some serious troubles after a failure due to network handling in the mongo replication process.

The config is the following. 3 Replica sets made of 5 servers, two barebone plus 1 arbiter on one datacenter, two other barebones in another datacenter.

At some point, we had a connectivity problem between the datacenters, for a couple of hours. We expected everything to get back to normal later on, with a sync of the oplog.

Unfortunately one of the three replica sets failed, the two disconnected servers managed to get a majority and a big rollback started.

On the datacenter A, all three nodes were running smoothly. On datacenter B, the two nodes were able to see each other, but lost connection to the primary. One of the node then tried to issue an election, that it couldn't win because it had no majority.

Unfortunately, when the connection was established again, the node first saw the arbiter and only some milliseconds later the other members. But where it should have resetted the replicationTimeout to 10s at the moment where there is a majority, it just started a replication within the few milliseconds, without waiting to connect to the other members. This node got 2 votes and was then able to become a primary, even with several hours of data less... -> Big Rollback

I see two possible fixes (maybe both good):

  • Make sure to reset the heartbeat timeout to 10s when an election becomes possible again (majority of nodes visible)
  • Prevent the arbiter to vote in favour of another node when it is seeing a PRIMARY

Here the logs on this node:

 
2016-09-10T14:17:08.511+0200 I NETWORK  [conn297067] end connection 10.0.0.32:57132 (49 connections now open)
2016-09-10T14:17:22.788+0200 I REPL     [ReplicationExecutor] could not find member to sync from
2016-09-10T14:17:23.508+0200 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2016-09-10T14:17:23.508+0200 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2016-09-10T14:17:24.267+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to mongo-rs0-2:27017; ExceededTimeLimit: Operation timed out
2016-09-10T14:17:24.550+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to mongo-rs0-arb:27017; ExceededTimeLimit: Couldn't get a connection within the tim
e limit
2016-09-10T14:17:24.802+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to mongo-rs0-1:27017; ExceededTimeLimit: Couldn't get a connection within the time
limit
2016-09-10T14:17:33.508+0200 I REPL     [ReplicationExecutor] VoteRequester: Got failed response from mongo-rs0-arb:27017: ExceededTimeLimit: Couldn't get a connection with
in the time limit
2016-09-10T14:17:33.508+0200 I REPL     [ReplicationExecutor] VoteRequester: Got failed response from mongo-rs0-1:27017: ExceededTimeLimit: Couldn't get a connection within
 the time limit
2016-09-10T14:17:33.508+0200 I REPL     [ReplicationExecutor] VoteRequester: Got failed response from mongo-rs0-2:27017: ExceededTimeLimit: Couldn't get a connection within
 the time limit
2016-09-10T14:17:33.508+0200 I REPL     [ReplicationExecutor] not running for primary, we received insufficient votes
2016-09-10T14:17:34.223+0200 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable
 
### connection broken between the datacenters in the meanwhile
 
2016-09-10T20:52:52.925+0200 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable
2016-09-10T20:52:59.574+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to mongo-rs0-2:27017; ExceededTimeLimit: Operation timed out
2016-09-10T20:53:00.072+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to mongo-rs0-1:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2016-09-10T20:53:03.259+0200 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable
2016-09-10T20:53:04.576+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to mongo-rs0-2:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2016-09-10T20:53:04.987+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to mongo-rs0-arb:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2016-09-10T20:53:10.072+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to mongo-rs0-1:27017; ExceededTimeLimit: Operation timed out
2016-09-10T20:53:12.752+0200 I NETWORK  [initandlisten] connection accepted from 10.0.0.190:60202 #302216 (3 connections now open)
2016-09-10T20:53:12.770+0200 I NETWORK  [conn302216] end connection 10.0.0.190:60202 (2 connections now open)
2016-09-10T20:53:13.027+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo-rs0-arb:27017
2016-09-10T20:53:13.044+0200 I REPL     [ReplicationExecutor] Member mongo-rs0-arb:27017 is now in state ARBITER
2016-09-10T20:53:13.111+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo-rs0-1:27017
2016-09-10T20:53:13.128+0200 I REPL     [ReplicationExecutor] Member mongo-rs0-1:27017 is now in state SECONDARY
2016-09-10T20:53:13.236+0200 I NETWORK  [initandlisten] connection accepted from 10.0.0.32:49440 #302217 (3 connections now open)
2016-09-10T20:53:13.324+0200 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2016-09-10T20:53:13.324+0200 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2016-09-10T20:53:13.341+0200 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from mongo-rs0-1:27017 because: candidate's data is staler than mine, resp:{ term: 39, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
2016-09-10T20:53:13.358+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo-rs0-2:27017
2016-09-10T20:53:13.364+0200 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2016-09-10T20:53:13.375+0200 I REPL     [ReplicationExecutor] Member mongo-rs0-2:27017 is now in state PRIMARY
2016-09-10T20:53:13.381+0200 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from mongo-rs0-1:27017 because: candidate's data is staler than mine, resp:{ term: 40, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
2016-09-10T20:53:13.381+0200 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 40
2016-09-10T20:53:13.381+0200 I REPL     [ReplicationExecutor] transition to PRIMARY
2016-09-10T20:53:13.411+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo-rs0-2:27017
2016-09-10T20:53:13.428+0200 I REPL     [ReplicationExecutor] Member mongo-rs0-2:27017 is now in state SECONDARY



 Comments   
Comment by Spencer Brody (Inactive) [ 07/Apr/17 ]

MRigal, FYI we wound up implementing the change to arbiter behavior you describe in SERVER-27125, which is available in 3.2.12 and 3.4.2.

Cheers,
-Spencer

Comment by Kelsey Schubert [ 23/Sep/16 ]

Hi MRigal,

Thank you for reporting this behavior. Unfortunately, to ensure elections readily occur in the case of failover, the fixes you suggest cannot be implemented.

When writes are not being replicated to the majority of data bearing nodes, there is a risk that rollbacks may occur. To avoid rollbacks in this specific situation, there are a number of configuration changes that you may want to consider, including changing your write concern or protocol version.

Looking forward, my recommendation to resolve this issue would be to replace your arbiter with a mongod. This would ensure that, in the case of a network partition between your data centers, write would continue to be replicated to a majority of data bearing nodes.

There is an open ticket, SERVER-14539, which would add an oplog to the arbiter. This new feature would provide an additional solution to this issue. I recommend reading through the description on this ticket for a more complete explanation of this feature and its impact.

For MongoDB-related support discussion please post on the mongodb-users group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group.

Kind regards,
Thomas

Comment by Matthieu Rigal [ 21/Sep/16 ]

Please note that the network timeout reset problem might be fixed at a more global level, as the same mechanism led to following problem also https://jira.mongodb.org/browse/SERVER-26215

Generated at Thu Feb 08 04:11:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.