[SERVER-43118] Primary randomly steps down without being explicitly told Created: 01/Sep/19  Updated: 11/Oct/19  Resolved: 11/Oct/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Chad Kreimendahl Assignee: Danny Hatcher (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

It begins:

2019-08-31T00:24:47.927-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:47.929-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:47.984-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:47.984-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:47.985-0500 I REPL     [ReplicationExecutor] Member mongo-machine01b.internal-domain.xyz:27017 is now in state PRIMARY
2019-08-31T00:24:48.600-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:48.601-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:48.601-0500 I REPL     [ReplicationExecutor] Member ev-mongo01a.internal-domain.xyz:27017 is now in state SECONDARY
2019-08-31T00:24:48.930-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:48.930-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:48.930-0500 I REPL     [ReplicationExecutor] Member mongo-machineB01.internal-domain.xyz:27017 is now in state SECONDARY
2019-08-31T00:24:48.985-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:48.985-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:48.986-0500 I REPL     [ReplicationExecutor] Member mongo-machine01b.internal-domain.xyz:27017 is now in state PRIMARY
2019-08-31T00:24:49.603-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:49.603-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:49.931-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:49.931-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:49.986-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246
2019-08-31T00:24:49.986-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 246

 

.... a bunch of clients reconnecting and queries in the queue executing

 

2019-08-31T00:24:50.636-0500 I REPL     [replExecDBWorker-2] transition to SECONDARY
2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] legacy transport layer closing all connections
2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576628
2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576627
2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576624
2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576623
2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576622
2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576621
2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576620
2019-08-31T00:24:50.636-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576619
2019-08-31T00:24:50.639-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576400
2019-08-31T00:24:50.639-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576399
2019-08-31T00:24:50.641-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 576062
2019-08-31T00:24:50.642-0500 I NETWORK  [replExecDBWorker-2] Skip closing connection for connection # 575947
2019-08-31T00:24:50.643-0500 I -        [conn576612] AssertionException handling request, closing client connection: 172 Operation attempted on a closed transport Session.
2019-08-31T00:24:50.681-0500 I -        [conn535386] AssertionException handling request, closing client connection: 172 Operation attempted on a closed transport Session.
2019-08-31T00:24:50.681-0500 I -        [conn535392] AssertionException handling request, closing client connection: 172 Operation attempted on a closed transport Session.
(dozens of the above line repeat)

 

.... a bunch of clients reconnecting

 

2019-08-31T00:24:51.680-0500 I REPL     [rsBackgroundSync] chaining not allowed, choosing primary as sync source candidate: mongo-machine01b.internal-domain.xyz:27017
2019-08-31T00:24:51.683-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to mongo-machine01b.internal-domain.xyz:27017
2019-08-31T00:24:51.693-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to mongo-machine01b.internal-domain.xyz:27017, took 11ms (1 connections now open to mongo-machine01b.internal-domain.xyz:27017)
2019-08-31T00:24:51.771-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to mongo-machine01b.internal-domain.xyz:27017
2019-08-31T00:24:51.775-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to mongo-machine01b.internal-domain.xyz:27017, took 5ms (2 connections now open to mongo-machine01b.internal-domain.xyz:27017)

.... a bunch of clients reconnecting. And then it decides it wants to be PRIMARY again:

 

2019-08-31T00:24:57.648-0500 I REPL     [ReplicationExecutor] Canceling priority takeover callback
2019-08-31T00:24:57.648-0500 I REPL     [ReplicationExecutor] Starting an election for a priority takeover
2019-08-31T00:24:57.648-0500 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected. current term: 246
2019-08-31T00:24:57.649-0500 I REPL     [ReplicationExecutor] VoteRequester(term 246 dry run) received a yes vote from mongo-machineB01.internal-domain.xyz:27017; response message: { term: 246, voteGranted: true, reason: "", ok: 1.0 }
2019-08-31T00:24:57.649-0500 I REPL     [ReplicationExecutor] dry election run succeeded, running for election in term 247
2019-08-31T00:24:57.649-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending connection to host mongo-machine01b.internal-domain.xyz:27017 due to bad connection status; 0 connections to that host remain open
2019-08-31T00:24:57.653-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongo-machine01b.internal-domain.xyz:27017
2019-08-31T00:24:57.661-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo-machine01b.internal-domain.xyz:27017, took 8ms (1 connections now open to mongo-machine01b.internal-domain.xyz:27017)
2019-08-31T00:24:57.666-0500 I REPL     [ReplicationExecutor] VoteRequester(term 247) received a yes vote from mongo-machineB01.internal-domain.xyz:27017; response message: { term: 247, voteGranted: true, reason: "", ok: 1.0 }
2019-08-31T00:24:57.666-0500 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 247
2019-08-31T00:24:57.666-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending connection to host mongo-machine01b.internal-domain.xyz:27017 due to bad connection status; 0 connections to that host remain open
2019-08-31T00:24:57.666-0500 I REPL     [ReplicationExecutor] transition to PRIMARY
2019-08-31T00:24:57.666-0500 I REPL     [ReplicationExecutor] Entering primary catch-up mode.
2019-08-31T00:24:57.666-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongo-machine01b.internal-domain.xyz:27017
2019-08-31T00:24:57.697-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo-machine01b.internal-domain.xyz:27017, took 31ms (1 connections now open to mongo-machine01b.internal-domain.xyz:27017)
2019-08-31T00:24:57.697-0500 I REPL     [ReplicationExecutor] Member mongo-machine01b.internal-domain.xyz:27017 is now in state SECONDARY
2019-08-31T00:24:57.697-0500 I REPL     [ReplicationExecutor] Caught up to the latest optime known via heartbeats after becoming primary.
2019-08-31T00:24:57.698-0500 I REPL     [ReplicationExecutor] Exited primary catch-up mode.
2019-08-31T00:24:57.698-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host mongo-machine01b.internal-domain.xyz:27017 due to bad connection status; 1 connections to that host remain open
2019-08-31T00:24:57.698-0500 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re-evaluating our sync source.
2019-08-31T00:24:58.216-0500 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-machine01b.internal-domain.xyz:27017: InvalidSyncSource: Sync source was cleared. Was mongo-machine01b.internal-domain.xyz:27017
2019-08-31T00:24:59.591-0500 I REPL     [rsSync] transition to primary complete; database writes are now permitted

Participants:

 Description   

One of our clusters (out of dozens), will randomly tell itself to step down from being primary, without any commands being sent to the system to tell it. 

 

We see it approximately 1-2 times per week, and always either in the valleys or peaks of our usage.



 Comments   
Comment by Danny Hatcher (Inactive) [ 11/Oct/19 ]

I'm going to close this ticket for now then. If you do see it happen again, please leave a comment and we will re-open.

Comment by Chad Kreimendahl [ 08/Oct/19 ]

It has not. Though I suspect in large part that is due to our moving about 99.5% of the databases (by table count and size) off of the system and on to alternate systems.

Comment by Danny Hatcher (Inactive) [ 01/Oct/19 ]

mongo@phish.org, did you ever see this happen again?

Comment by Chad Kreimendahl [ 05/Sep/19 ]

I'll need to wait for it to happen again. I'll check each morning and get you that once we see it again. Will likely be in the next week, if history follows.

Comment by Danny Hatcher (Inactive) [ 03/Sep/19 ]

Hello Chad,

In order for use to diagnose, can you please upload the full mongod logs covering at least one of these problem time frames as well as the "diagnostic.data" folder for every node in the replica set? You can use our Secure Uploader so that only MongoDB engineers will be able to access them.

Comment by Chad Kreimendahl [ 01/Sep/19 ]

Forgot: Version 3.4.22 - 4 member replicaset. significant cpu, memory and disk available.

Generated at Thu Feb 08 05:02:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.