Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-43118

Primary randomly steps down without being explicitly told

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • ALL
    • Hide

      It begins:

      2019-08-31T00:24:47.927-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:47.929-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:47.984-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:47.984-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:47.985-0500 I REPL     [ReplicationExecutor] Member mongo-machine01b.internal-domain.xyz:27017 is now in state PRIMARY
      2019-08-31T00:24:48.600-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:48.601-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:48.601-0500 I REPL     [ReplicationExecutor] Member ev-mongo01a.internal-domain.xyz:27017 is now in state SECONDARY
      2019-08-31T00:24:48.930-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:48.930-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:48.930-0500 I REPL     [ReplicationExecutor] Member mongo-machineB01.internal-domain.xyz:27017 is now in state SECONDARY
      2019-08-31T00:24:48.985-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:48.985-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:48.986-0500 I REPL     [ReplicationExecutor] Member mongo-machine01b.internal-domain.xyz:27017 is now in state PRIMARY
      2019-08-31T00:24:49.603-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:49.603-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:49.931-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:49.931-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:49.986-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
      2019-08-31T00:24:49.986-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246

       

      .... a bunch of clients reconnecting and queries in the queue executing

       

      2019-08-31T00:24:50.636-0500 I REPL     [replExecDBWorker-2] transition to SECONDARY
      2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] legacy transport layer closing all connections
      2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576628
      2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576627
      2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576624
      2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576623
      2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576622
      2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576621
      2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576620
      2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576619
      2019-08-31T00:24:50.639-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576400
      2019-08-31T00:24:50.639-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576399
      2019-08-31T00:24:50.641-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576062
      2019-08-31T00:24:50.642-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 575947
      2019-08-31T00:24:50.643-0500 I -        [conn576612] AssertionException handling request, closing client connection: 172 Operation attempted on a closed transport Session.
      2019-08-31T00:24:50.681-0500 I -        [conn535386] AssertionException handling request, closing client connection: 172 Operation attempted on a closed transport Session.
      2019-08-31T00:24:50.681-0500 I -        [conn535392] AssertionException handling request, closing client connection: 172 Operation attempted on a closed transport Session.
      (dozens of the above line repeat)

       

      .... a bunch of clients reconnecting

       

      2019-08-31T00:24:51.680-0500 I REPL     [rsBackgroundSync] chaining not allowed, choosing primary as sync source candidate: mongo-machine01b.internal-domain.xyz:27017
      2019-08-31T00:24:51.683-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to mongo-machine01b.internal-domain.xyz:27017
      2019-08-31T00:24:51.693-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to mongo-machine01b.internal-domain.xyz:27017, took 11ms (1 connections now open to mongo-machine01b.internal-domain.xyz:27017)
      2019-08-31T00:24:51.771-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to mongo-machine01b.internal-domain.xyz:27017
      2019-08-31T00:24:51.775-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to mongo-machine01b.internal-domain.xyz:27017, took 5ms (2 connections now open to mongo-machine01b.internal-domain.xyz:27017)
      

      .... a bunch of clients reconnecting. And then it decides it wants to be PRIMARY again:

       

      2019-08-31T00:24:57.648-0500 I REPL     [ReplicationExecutor] Canceling priority takeover callback
      2019-08-31T00:24:57.648-0500 I REPL     [ReplicationExecutor] Starting an election for a priority takeover
      2019-08-31T00:24:57.648-0500 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected. current term: 246
      2019-08-31T00:24:57.649-0500 I REPL     [ReplicationExecutor] VoteRequester(term 246 dry run) received a yes vote from mongo-machineB01.internal-domain.xyz:27017; response message: { term: 246, voteGranted: true, reason: "", ok: 1.0 }
      2019-08-31T00:24:57.649-0500 I REPL     [ReplicationExecutor] dry election run succeeded, running for election in term 247
      2019-08-31T00:24:57.649-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending connection to host mongo-machine01b.internal-domain.xyz:27017 due to bad connection status; 0 connections to that host remain open
      2019-08-31T00:24:57.653-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongo-machine01b.internal-domain.xyz:27017
      2019-08-31T00:24:57.661-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo-machine01b.internal-domain.xyz:27017, took 8ms (1 connections now open to mongo-machine01b.internal-domain.xyz:27017)
      2019-08-31T00:24:57.666-0500 I REPL     [ReplicationExecutor] VoteRequester(term 247) received a yes vote from mongo-machineB01.internal-domain.xyz:27017; response message: { term: 247, voteGranted: true, reason: "", ok: 1.0 }
      2019-08-31T00:24:57.666-0500 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 247
      2019-08-31T00:24:57.666-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending connection to host mongo-machine01b.internal-domain.xyz:27017 due to bad connection status; 0 connections to that host remain open
      2019-08-31T00:24:57.666-0500 I REPL     [ReplicationExecutor] transition to PRIMARY
      2019-08-31T00:24:57.666-0500 I REPL     [ReplicationExecutor] Entering primary catch-up mode.
      2019-08-31T00:24:57.666-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongo-machine01b.internal-domain.xyz:27017
      2019-08-31T00:24:57.697-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo-machine01b.internal-domain.xyz:27017, took 31ms (1 connections now open to mongo-machine01b.internal-domain.xyz:27017)
      2019-08-31T00:24:57.697-0500 I REPL     [ReplicationExecutor] Member mongo-machine01b.internal-domain.xyz:27017 is now in state SECONDARY
      2019-08-31T00:24:57.697-0500 I REPL     [ReplicationExecutor] Caught up to the latest optime known via heartbeats after becoming primary.
      2019-08-31T00:24:57.698-0500 I REPL     [ReplicationExecutor] Exited primary catch-up mode.
      2019-08-31T00:24:57.698-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host mongo-machine01b.internal-domain.xyz:27017 due to bad connection status; 1 connections to that host remain open
      2019-08-31T00:24:57.698-0500 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re-evaluating our sync source.
      2019-08-31T00:24:58.216-0500 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-machine01b.internal-domain.xyz:27017: InvalidSyncSource: Sync source was cleared. Was mongo-machine01b.internal-domain.xyz:27017
      2019-08-31T00:24:59.591-0500 I REPL     [rsSync] transition to primary complete; database writes are now permitted
      
      Show
      It begins: 2019-08-31T00:24:47.927-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:47.929-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:47.984-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:47.984-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:47.985-0500 I REPL [ReplicationExecutor] Member mongo-machine01b.internal-domain.xyz:27017 is now in state PRIMARY 2019-08-31T00:24:48.600-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:48.601-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:48.601-0500 I REPL [ReplicationExecutor] Member ev-mongo01a.internal-domain.xyz:27017 is now in state SECONDARY 2019-08-31T00:24:48.930-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:48.930-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:48.930-0500 I REPL [ReplicationExecutor] Member mongo-machineB01.internal-domain.xyz:27017 is now in state SECONDARY 2019-08-31T00:24:48.985-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:48.985-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:48.986-0500 I REPL [ReplicationExecutor] Member mongo-machine01b.internal-domain.xyz:27017 is now in state PRIMARY 2019-08-31T00:24:49.603-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:49.603-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:49.931-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:49.931-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:49.986-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246 2019-08-31T00:24:49.986-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 246   .... a bunch of clients reconnecting and queries in the queue executing   2019-08-31T00:24:50.636-0500 I REPL [replExecDBWorker-2] transition to SECONDARY 2019-08-31T00:24:50.636-0500 I NETWORK [replExecDBWorker-2] legacy transport layer closing all connections 2019-08-31T00:24:50.636-0500 I NETWORK [replExecDBWorker-2] Skip closing connection for connection # 576628 2019-08-31T00:24:50.636-0500 I NETWORK [replExecDBWorker-2] Skip closing connection for connection # 576627 2019-08-31T00:24:50.636-0500 I NETWORK [replExecDBWorker-2] Skip closing connection for connection # 576624 2019-08-31T00:24:50.636-0500 I NETWORK [replExecDBWorker-2] Skip closing connection for connection # 576623 2019-08-31T00:24:50.636-0500 I NETWORK [replExecDBWorker-2] Skip closing connection for connection # 576622 2019-08-31T00:24:50.636-0500 I NETWORK [replExecDBWorker-2] Skip closing connection for connection # 576621 2019-08-31T00:24:50.636-0500 I NETWORK [replExecDBWorker-2] Skip closing connection for connection # 576620 2019-08-31T00:24:50.636-0500 I NETWORK [replExecDBWorker-2] Skip closing connection for connection # 576619 2019-08-31T00:24:50.639-0500 I NETWORK [replExecDBWorker-2] Skip closing connection for connection # 576400 2019-08-31T00:24:50.639-0500 I NETWORK [replExecDBWorker-2] Skip closing connection for connection # 576399 2019-08-31T00:24:50.641-0500 I NETWORK [replExecDBWorker-2] Skip closing connection for connection # 576062 2019-08-31T00:24:50.642-0500 I NETWORK [replExecDBWorker-2] Skip closing connection for connection # 575947 2019-08-31T00:24:50.643-0500 I - [conn576612] AssertionException handling request, closing client connection: 172 Operation attempted on a closed transport Session. 2019-08-31T00:24:50.681-0500 I - [conn535386] AssertionException handling request, closing client connection: 172 Operation attempted on a closed transport Session. 2019-08-31T00:24:50.681-0500 I - [conn535392] AssertionException handling request, closing client connection: 172 Operation attempted on a closed transport Session. (dozens of the above line repeat)   .... a bunch of clients reconnecting   2019-08-31T00:24:51.680-0500 I REPL [rsBackgroundSync] chaining not allowed, choosing primary as sync source candidate: mongo-machine01b.internal-domain.xyz:27017 2019-08-31T00:24:51.683-0500 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to mongo-machine01b.internal-domain.xyz:27017 2019-08-31T00:24:51.693-0500 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to mongo-machine01b.internal-domain.xyz:27017, took 11ms (1 connections now open to mongo-machine01b.internal-domain.xyz:27017) 2019-08-31T00:24:51.771-0500 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to mongo-machine01b.internal-domain.xyz:27017 2019-08-31T00:24:51.775-0500 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to mongo-machine01b.internal-domain.xyz:27017, took 5ms (2 connections now open to mongo-machine01b.internal-domain.xyz:27017) .... a bunch of clients reconnecting. And then it decides it wants to be PRIMARY again:   2019-08-31T00:24:57.648-0500 I REPL [ReplicationExecutor] Canceling priority takeover callback 2019-08-31T00:24:57.648-0500 I REPL [ReplicationExecutor] Starting an election for a priority takeover 2019-08-31T00:24:57.648-0500 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected. current term: 246 2019-08-31T00:24:57.649-0500 I REPL [ReplicationExecutor] VoteRequester(term 246 dry run) received a yes vote from mongo-machineB01.internal-domain.xyz:27017; response message: { term: 246, voteGranted: true, reason: "", ok: 1.0 } 2019-08-31T00:24:57.649-0500 I REPL [ReplicationExecutor] dry election run succeeded, running for election in term 247 2019-08-31T00:24:57.649-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Ending connection to host mongo-machine01b.internal-domain.xyz:27017 due to bad connection status; 0 connections to that host remain open 2019-08-31T00:24:57.653-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongo-machine01b.internal-domain.xyz:27017 2019-08-31T00:24:57.661-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo-machine01b.internal-domain.xyz:27017, took 8ms (1 connections now open to mongo-machine01b.internal-domain.xyz:27017) 2019-08-31T00:24:57.666-0500 I REPL [ReplicationExecutor] VoteRequester(term 247) received a yes vote from mongo-machineB01.internal-domain.xyz:27017; response message: { term: 247, voteGranted: true, reason: "", ok: 1.0 } 2019-08-31T00:24:57.666-0500 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 247 2019-08-31T00:24:57.666-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Ending connection to host mongo-machine01b.internal-domain.xyz:27017 due to bad connection status; 0 connections to that host remain open 2019-08-31T00:24:57.666-0500 I REPL [ReplicationExecutor] transition to PRIMARY 2019-08-31T00:24:57.666-0500 I REPL [ReplicationExecutor] Entering primary catch-up mode. 2019-08-31T00:24:57.666-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongo-machine01b.internal-domain.xyz:27017 2019-08-31T00:24:57.697-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo-machine01b.internal-domain.xyz:27017, took 31ms (1 connections now open to mongo-machine01b.internal-domain.xyz:27017) 2019-08-31T00:24:57.697-0500 I REPL [ReplicationExecutor] Member mongo-machine01b.internal-domain.xyz:27017 is now in state SECONDARY 2019-08-31T00:24:57.697-0500 I REPL [ReplicationExecutor] Caught up to the latest optime known via heartbeats after becoming primary. 2019-08-31T00:24:57.698-0500 I REPL [ReplicationExecutor] Exited primary catch-up mode. 2019-08-31T00:24:57.698-0500 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host mongo-machine01b.internal-domain.xyz:27017 due to bad connection status; 1 connections to that host remain open 2019-08-31T00:24:57.698-0500 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source. 2019-08-31T00:24:58.216-0500 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-machine01b.internal-domain.xyz:27017: InvalidSyncSource: Sync source was cleared. Was mongo-machine01b.internal-domain.xyz:27017 2019-08-31T00:24:59.591-0500 I REPL [rsSync] transition to primary complete; database writes are now permitted

      One of our clusters (out of dozens), will randomly tell itself to step down from being primary, without any commands being sent to the system to tell it. 

       

      We see it approximately 1-2 times per week, and always either in the valleys or peaks of our usage.

            Assignee:
            daniel.hatcher@mongodb.com Danny Hatcher (Inactive)
            Reporter:
            mongo@phish.org Chad Kreimendahl
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: