Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-48962

Member in replica set switches between primary and secondary every few minutes

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Incomplete
    • Affects Version/s: 4.0.18, 4.0.19
    • Fix Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Operating System:
      ALL

      Description

      We have a Primary-Secondary-Arbiter setup with version 4.0.18 (upgraded to 4.0.19 but still the problem persists) which started experiencing some very strange behavior. The three nodes are called Osiril, Horus and Isis (Arbiter). Every few minutes (from 2 to 10 min looking at the log) we get this in the log:

       

      2020-06-15T18:56:29.734+0300 I REPL [replexec-6] Member Isis.local:27017 is now in state RS_DOWN
      2020-06-15T18:56:29.734+0300 I REPL [replexec-6] Member osiris.local:27017 is now in state RS_DOWN
      2020-06-15T18:56:29.735+0300 I REPL [replexec-6] can't see a majority of the set, relinquishing primary
      2020-06-15T18:56:29.735+0300 I REPL [replexec-6] Stepping down from primary in response to heartbeat
      2020-06-15T18:56:29.735+0300 I REPL [replexec-6] transition to SECONDARY from PRIMARY
      2020-06-15T18:56:29.735+0300 I NETWORK [replexec-6] Skip closing connection for connection # 40
      2020-06-15T18:56:29.735+0300 I NETWORK [replexec-6] Skip closing connection for connection # 35
      2020-06-15T18:56:29.735+0300 I REPL [replexec-0] Member osiris.local:27017 is now in state SECONDARY
      2020-06-15T18:56:29.735+0300 I REPL [replexec-1] Member Isis.local:27017 is now in state ARBITER
       
      ...
       
      2020-06-15T18:56:30.236+0300 I REPL [replexec-1] Member osiris.local:27017 is now in state PRIMARY
       
      ...
       
      2020-06-15T18:56:31.225+0300 I REPL [rsBackgroundSync] sync source candidate: osiris.local:27017
      2020-06-15T18:56:31.227+0300 I REPL [rsBackgroundSync] Changed sync source from empty to osiris.local:27017
      2020-06-15T18:56:31.227+0300 I ASIO [RS] Connecting to osiris.local:27017
       
      ...
       
      2020-06-15T18:57:29.753+0300 I REPL [replexec-0] Starting an election, since we've seen no PRIMARY in the past 10000ms
      2020-06-15T18:57:29.753+0300 I REPL [replexec-0] conducting a dry run election to see if we could be elected. current term: 1136
      2020-06-15T18:57:29.754+0300 I ASIO [Replication] Connecting to Isis.local:27017
      2020-06-15T18:57:29.754+0300 I ASIO [Replication] Connecting to osiris.local:27017
      2020-06-15T18:57:29.754+0300 I REPL [replexec-1] VoteRequester(term 1136 dry run) received a no vote from Isis.local:27017 with reason "can see a healthy primary (osiris.local:27017) of equal or greater priority"; response message: \{ term: 1136, voteGranted: false, reason: "can see a healthy primary (osiris.local:27017) of equal or greater priority", ok: 1.0 }
      2020-06-15T18:57:29.754+0300 I REPL [replexec-0] VoteRequester(term 1136 dry run) received a yes vote from osiris.local:27017; response message: \{ term: 1136, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1592236637, 1), $clusterTime: { clusterTime: Timestamp(1592236637, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
      2020-06-15T18:57:29.754+0300 I REPL [replexec-0] dry election run succeeded, running for election in term 1137
      2020-06-15T18:57:29.755+0300 I REPL [replexec-6] VoteRequester(term 1137) received a no vote from Isis.local:27017 with reason "can see a healthy primary (osiris.local:27017) of equal or greater priority"; response message: \{ term: 1137, voteGranted: false, reason: "can see a healthy primary (osiris.local:27017) of equal or greater priority", ok: 1.0 }
      2020-06-15T18:57:29.759+0300 I REPL [replexec-1] VoteRequester(term 1137) received a yes vote from osiris.local:27017; response message: \{ term: 1137, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1592236637, 1), $clusterTime: { clusterTime: Timestamp(1592236637, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
      2020-06-15T18:57:29.759+0300 I REPL [replexec-1] election succeeded, assuming primary role in term 1137
      2020-06-15T18:57:29.760+0300 I REPL [replexec-1] transition to PRIMARY from SECONDARY
      2020-06-15T18:57:29.760+0300 I REPL [replexec-1] Resetting sync source to empty, which was osiris.local:27017
      2020-06-15T18:57:29.760+0300 I REPL [replexec-1] Entering primary catch-up mode.
      2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Member osiris.local:27017 is now in state SECONDARY
      2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Caught up to the latest optime known via heartbeats after becoming primary. Target optime: \{ ts: Timestamp(1592236637, 1), t: 1136 }. My Last Applied: \{ ts: Timestamp(1592236637, 1), t: 1136 }
      2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Exited primary catch-up mode.
      2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Stopping replication producer
      2020-06-15T18:57:29.761+0300 I REPL [ReplBatcher] Oplog buffer has been drained in term 1137
      2020-06-15T18:57:29.761+0300 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source.
      2020-06-15T18:57:29.761+0300 I CONNPOOL [RS] Ending connection to host osiris.local:27017 due to bad connection status; 1 connections to that host remain open
      2020-06-15T18:57:29.767+0300 I REPL [rsSync-0] transition to primary complete; database writes are now permitted
       
      ...
       
      2020-06-15T18:58:29.734+0300 I REPL [replexec-7] Member Isis.local:27017 is now in state RS_DOWN
      2020-06-15T18:58:29.735+0300 I REPL [replexec-7] Member osiris.local:27017 is now in state RS_DOWN
      2020-06-15T18:58:29.735+0300 I REPL [replexec-7] can't see a majority of the set, relinquishing primary
      2020-06-15T18:58:29.735+0300 I REPL [replexec-7] Stepping down from primary in response to heartbeat
      2020-06-15T18:58:29.735+0300 I REPL [replexec-7] transition to SECONDARY from PRIMARY
      
      

      And this repeats forever. It looks like some timer is killing the connection with the other replica and this causes the switch between primary and secondary but we have no way to diagnose this. This setup has been running for about 10 years now. The only thing we have changed is to move the virtual machine Horus to another server hardware which is faster. This problem causes all clients to break the connection every few minutes and in some cases crashes some services.

      Attached are the log files from all servers.

       

        Attachments

        1. diagnostic.data-horus-rs0.zip
          19.40 MB
        2. mongod-horus-rs0.zip
          2.11 MB
        3. mongod-isis-rs0.zip
          76 kB
        4. mongod-osiris-rs0.zip
          3.05 MB

          Activity

            People

            Assignee:
            dmitry.agranat Dmitry Agranat
            Reporter:
            vlad.dimitrov@gmail.com Vladimir Dimitrov
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: