[SERVER-12163] Replica Set failover time is more than 10 sec Created: 19/Dec/13  Updated: 06/Apr/23  Resolved: 19/Dec/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Amit Wankhede Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS


Issue Links:
Duplicate
duplicates SERVER-10225 Replica set failover speed improvement Closed
Related
Operating System: Linux
Steps To Reproduce:

Power the primary DB VM and verify the secondary DB logs.

Participants:

 Description   

We have three member replica set primary+secondary+arbiter across different virtual machines.

To reproduced the issue we power the primary DB. We have observed that failover time is more than 10 secs.
What is optimal failover time. Is there any tuning parameter to reduce the failover time?

Thu Dec 19 02:15:55.994 [rsSyncNotifier] replset setting oplog notifier to sessionmgr01:27717
Thu Dec 19 02:19:09.121 [rsHealthPoll] DBClientCursor::init call() failed
Thu Dec 19 02:19:09.121 [rsHealthPoll] replSet info sessionmgr01:27717 is down (or slow to respond):
Thu Dec 19 02:19:09.121 [rsHealthPoll] replSet member sessionmgr01:27717 is now in state DOWN
Thu Dec 19 02:19:09.122 [rsMgr] replSet info electSelf 2
Thu Dec 19 02:19:17.124 [rsHealthPoll] replset info sessionmgr01:27717 heartbeat failed, retrying
Thu Dec 19 02:19:28.071 [rsBackgroundSync] Socket recv() timeout 192.168.92.59:27717
Thu Dec 19 02:19:28.071 [rsBackgroundSync] SocketException: remote: 192.168.92.59:27717 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.92.59:27717]
Thu Dec 19 02:19:28.072 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: sessionmgr01:27717
Thu Dec 19 02:19:28.072 [rsSyncNotifier] Socket recv() timeout 192.168.92.59:27717
Thu Dec 19 02:19:28.072 [rsSyncNotifier] SocketException: remote: 192.168.92.59:27717 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.92.59:27717]
Thu Dec 19 02:19:28.072 [rsSyncNotifier] replset tracking exception: exception: 10278 dbclient error communicating with server: sessionmgr01:27717
Thu Dec 19 02:19:28.072 [rsMgr] replSet PRIMARY
Thu Dec 19 02:19:29.127 [rsHealthPoll] replset info sessionmgr01:27717 heartbeat failed, retrying



 Comments   
Comment by Gianfranco Palumbo [ 23/Dec/13 ]

The ticket Matt is referring is SERVER-10225

This is currently scheduled for 2.7.x (the development version of 2.8)

Please click on "Start watching this issue" to receive email updates on the status of the feature request.

Comment by Amit Wankhede [ 20/Dec/13 ]

Hi Matt,

Can you please provide more insights on this.

Comment by Matt Dannenberg [ 20/Dec/13 ]

Turns out there is a "speed up failovers" ticket. It is now linked.

Comment by Matt Dannenberg [ 20/Dec/13 ]

At this time, there is no concrete plan to improve replica set failover time specifically. We are planning to rework replica set internals in the near future. We anticipate that this will have a positive effect on failover time.

Comment by Amit Wankhede [ 20/Dec/13 ]

Hi Matt,

Thanks for your comments.

In which future release we will get the fix for failover time?

regards,
Amit

Comment by Matt Dannenberg [ 19/Dec/13 ]

Failover time is not tunable.

The way we detect a downed node is by a loss of heartbeats and heartbeat responses. Heartbeat responses time out after 10 seconds and then if we have not received a heartbeat from them in the past two seconds (they are sent every two seconds), we mark them as down. So it is common for the election process to take 10 seconds before it starts.

There are many other variables such as the latency between nodes and where or not the first election is successful. As a result, we make no guarantees with regard to how long a failover will take.

In a future release, we will be heavily reworking the internals of replication and a side effect of that should be reduced fail over time.

Generated at Thu Feb 08 03:27:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.