Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-23471

Replication heartbeat command is blocked on I/O operations

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.3.3
    • Component/s: Replication
    • None
    • Replication
    • ALL
    • Hide

      Attempts to to simulate slow I/O using tickle and nbd are in progress.

      Show
      Attempts to to simulate slow I/O using tickle and nbd are in progress.
    • 0
    • None
    • 0
    • None
    • None
    • None
    • None
    • None
    • None

      If there is a very slow I/O operation on the primary, then the primary could lose communication to the secondaries, forcing an election when the I/O operation completes.

      This was observed with an apparently slow hard drive during the creation of the namespace file, which took over 10 seconds.

      [js_test:localhostAuthBypass] 2016-03-16T17:38:10.617+0000 d20010| 2016-03-16T17:38:10.617+0000 I INDEX    [conn8] †\data\db\job0\mongorunner\replsets_server-6591-0\admin.ns, filling with zeroes...
      [js_test:localhostAuthBypass] 2016-03-16T17:38:19.625+0000 d20010| 2016-03-16T17:38:19.625+0000 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary
      [js_test:localhostAuthBypass] 2016-03-16T17:38:19.625+0000 d20010| 2016-03-16T17:38:19.625+0000 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
      [js_test:localhostAuthBypass] 2016-03-16T17:38:19.625+0000 d20010| 2016-03-16T17:38:19.625+0000 I REPL     [ReplicationExecutor] Member localhost:20011 is now in state SECONDARY
      [js_test:localhostAuthBypass] 2016-03-16T17:38:19.651+0000 d20010| 2016-03-16T17:38:19.651+0000 I REPL     [ReplicationExecutor] Member localhost:20012 is now in state SECONDARY
      [js_test:localhostAuthBypass] 2016-03-16T17:38:20.381+0000 d20011| 2016-03-16T17:38:20.381+0000 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
      [js_test:localhostAuthBypass] 2016-03-16T17:38:20.381+0000 d20011| 2016-03-16T17:38:20.381+0000 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
      [js_test:localhostAuthBypass] 2016-03-16T17:38:20.381+0000 d20010| 2016-03-16T17:38:20.382+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:56040 #9 (5 connections now open)
      [js_test:localhostAuthBypass] 2016-03-16T17:38:20.381+0000 d20011| 2016-03-16T17:38:20.382+0000 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
      [js_test:localhostAuthBypass] 2016-03-16T17:38:20.382+0000 d20010| 2016-03-16T17:38:20.383+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:56041 #10 (6 connections now open)
      [js_test:localhostAuthBypass] 2016-03-16T17:38:20.384+0000 d20011| 2016-03-16T17:38:20.383+0000 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 2
      [js_test:localhostAuthBypass] 2016-03-16T17:38:20.384+0000 d20011| 2016-03-16T17:38:20.384+0000 I REPL     [ReplicationExecutor] transition to PRIMARY
      

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            jonathan.abrahams Jonathan Abrahams (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: