[SERVER-23471] Replication heartbeat command is blocked on I/O operations Created: 01/Apr/16  Updated: 06/Dec/22  Resolved: 19/Apr/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.3.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Backlog - Replication Team
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Assigned Teams:
Replication
Operating System: ALL
Steps To Reproduce:

Attempts to to simulate slow I/O using tickle and nbd are in progress.

Participants:
Linked BF Score: 0

 Description   

If there is a very slow I/O operation on the primary, then the primary could lose communication to the secondaries, forcing an election when the I/O operation completes.

This was observed with an apparently slow hard drive during the creation of the namespace file, which took over 10 seconds.

[js_test:localhostAuthBypass] 2016-03-16T17:38:10.617+0000 d20010| 2016-03-16T17:38:10.617+0000 I INDEX    [conn8] †\data\db\job0\mongorunner\replsets_server-6591-0\admin.ns, filling with zeroes...
[js_test:localhostAuthBypass] 2016-03-16T17:38:19.625+0000 d20010| 2016-03-16T17:38:19.625+0000 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary
[js_test:localhostAuthBypass] 2016-03-16T17:38:19.625+0000 d20010| 2016-03-16T17:38:19.625+0000 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
[js_test:localhostAuthBypass] 2016-03-16T17:38:19.625+0000 d20010| 2016-03-16T17:38:19.625+0000 I REPL     [ReplicationExecutor] Member localhost:20011 is now in state SECONDARY
[js_test:localhostAuthBypass] 2016-03-16T17:38:19.651+0000 d20010| 2016-03-16T17:38:19.651+0000 I REPL     [ReplicationExecutor] Member localhost:20012 is now in state SECONDARY
[js_test:localhostAuthBypass] 2016-03-16T17:38:20.381+0000 d20011| 2016-03-16T17:38:20.381+0000 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
[js_test:localhostAuthBypass] 2016-03-16T17:38:20.381+0000 d20011| 2016-03-16T17:38:20.381+0000 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
[js_test:localhostAuthBypass] 2016-03-16T17:38:20.381+0000 d20010| 2016-03-16T17:38:20.382+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:56040 #9 (5 connections now open)
[js_test:localhostAuthBypass] 2016-03-16T17:38:20.381+0000 d20011| 2016-03-16T17:38:20.382+0000 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
[js_test:localhostAuthBypass] 2016-03-16T17:38:20.382+0000 d20010| 2016-03-16T17:38:20.383+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:56041 #10 (6 connections now open)
[js_test:localhostAuthBypass] 2016-03-16T17:38:20.384+0000 d20011| 2016-03-16T17:38:20.383+0000 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 2
[js_test:localhostAuthBypass] 2016-03-16T17:38:20.384+0000 d20011| 2016-03-16T17:38:20.384+0000 I REPL     [ReplicationExecutor] transition to PRIMARY



 Comments   
Comment by Gregory McKeon (Inactive) [ 19/Apr/18 ]

Heartbeats are no longer the primary channel for conveying liveness in a RS.

Generated at Thu Feb 08 04:03:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.