[SERVER-48962] Member in replica set switches between primary and secondary every few minutes Created: 18/Jun/20  Updated: 20/Jul/20  Resolved: 20/Jul/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.0.18, 4.0.19
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vladimir Dimitrov Assignee: Dmitry Agranat
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive diagnostic.data-horus-rs0.zip     Zip Archive mongod-horus-rs0.zip     Zip Archive mongod-isis-rs0.zip     Zip Archive mongod-osiris-rs0.zip    
Operating System: ALL
Participants:

 Description   

We have a Primary-Secondary-Arbiter setup with version 4.0.18 (upgraded to 4.0.19 but still the problem persists) which started experiencing some very strange behavior. The three nodes are called Osiril, Horus and Isis (Arbiter). Every few minutes (from 2 to 10 min looking at the log) we get this in the log:

 

2020-06-15T18:56:29.734+0300 I REPL [replexec-6] Member Isis.local:27017 is now in state RS_DOWN
2020-06-15T18:56:29.734+0300 I REPL [replexec-6] Member osiris.local:27017 is now in state RS_DOWN
2020-06-15T18:56:29.735+0300 I REPL [replexec-6] can't see a majority of the set, relinquishing primary
2020-06-15T18:56:29.735+0300 I REPL [replexec-6] Stepping down from primary in response to heartbeat
2020-06-15T18:56:29.735+0300 I REPL [replexec-6] transition to SECONDARY from PRIMARY
2020-06-15T18:56:29.735+0300 I NETWORK [replexec-6] Skip closing connection for connection # 40
2020-06-15T18:56:29.735+0300 I NETWORK [replexec-6] Skip closing connection for connection # 35
2020-06-15T18:56:29.735+0300 I REPL [replexec-0] Member osiris.local:27017 is now in state SECONDARY
2020-06-15T18:56:29.735+0300 I REPL [replexec-1] Member Isis.local:27017 is now in state ARBITER
 
...
 
2020-06-15T18:56:30.236+0300 I REPL [replexec-1] Member osiris.local:27017 is now in state PRIMARY
 
...
 
2020-06-15T18:56:31.225+0300 I REPL [rsBackgroundSync] sync source candidate: osiris.local:27017
2020-06-15T18:56:31.227+0300 I REPL [rsBackgroundSync] Changed sync source from empty to osiris.local:27017
2020-06-15T18:56:31.227+0300 I ASIO [RS] Connecting to osiris.local:27017
 
...
 
2020-06-15T18:57:29.753+0300 I REPL [replexec-0] Starting an election, since we've seen no PRIMARY in the past 10000ms
2020-06-15T18:57:29.753+0300 I REPL [replexec-0] conducting a dry run election to see if we could be elected. current term: 1136
2020-06-15T18:57:29.754+0300 I ASIO [Replication] Connecting to Isis.local:27017
2020-06-15T18:57:29.754+0300 I ASIO [Replication] Connecting to osiris.local:27017
2020-06-15T18:57:29.754+0300 I REPL [replexec-1] VoteRequester(term 1136 dry run) received a no vote from Isis.local:27017 with reason "can see a healthy primary (osiris.local:27017) of equal or greater priority"; response message: \{ term: 1136, voteGranted: false, reason: "can see a healthy primary (osiris.local:27017) of equal or greater priority", ok: 1.0 }
2020-06-15T18:57:29.754+0300 I REPL [replexec-0] VoteRequester(term 1136 dry run) received a yes vote from osiris.local:27017; response message: \{ term: 1136, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1592236637, 1), $clusterTime: { clusterTime: Timestamp(1592236637, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
2020-06-15T18:57:29.754+0300 I REPL [replexec-0] dry election run succeeded, running for election in term 1137
2020-06-15T18:57:29.755+0300 I REPL [replexec-6] VoteRequester(term 1137) received a no vote from Isis.local:27017 with reason "can see a healthy primary (osiris.local:27017) of equal or greater priority"; response message: \{ term: 1137, voteGranted: false, reason: "can see a healthy primary (osiris.local:27017) of equal or greater priority", ok: 1.0 }
2020-06-15T18:57:29.759+0300 I REPL [replexec-1] VoteRequester(term 1137) received a yes vote from osiris.local:27017; response message: \{ term: 1137, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1592236637, 1), $clusterTime: { clusterTime: Timestamp(1592236637, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
2020-06-15T18:57:29.759+0300 I REPL [replexec-1] election succeeded, assuming primary role in term 1137
2020-06-15T18:57:29.760+0300 I REPL [replexec-1] transition to PRIMARY from SECONDARY
2020-06-15T18:57:29.760+0300 I REPL [replexec-1] Resetting sync source to empty, which was osiris.local:27017
2020-06-15T18:57:29.760+0300 I REPL [replexec-1] Entering primary catch-up mode.
2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Member osiris.local:27017 is now in state SECONDARY
2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Caught up to the latest optime known via heartbeats after becoming primary. Target optime: \{ ts: Timestamp(1592236637, 1), t: 1136 }. My Last Applied: \{ ts: Timestamp(1592236637, 1), t: 1136 }
2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Exited primary catch-up mode.
2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Stopping replication producer
2020-06-15T18:57:29.761+0300 I REPL [ReplBatcher] Oplog buffer has been drained in term 1137
2020-06-15T18:57:29.761+0300 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source.
2020-06-15T18:57:29.761+0300 I CONNPOOL [RS] Ending connection to host osiris.local:27017 due to bad connection status; 1 connections to that host remain open
2020-06-15T18:57:29.767+0300 I REPL [rsSync-0] transition to primary complete; database writes are now permitted
 
...
 
2020-06-15T18:58:29.734+0300 I REPL [replexec-7] Member Isis.local:27017 is now in state RS_DOWN
2020-06-15T18:58:29.735+0300 I REPL [replexec-7] Member osiris.local:27017 is now in state RS_DOWN
2020-06-15T18:58:29.735+0300 I REPL [replexec-7] can't see a majority of the set, relinquishing primary
2020-06-15T18:58:29.735+0300 I REPL [replexec-7] Stepping down from primary in response to heartbeat
2020-06-15T18:58:29.735+0300 I REPL [replexec-7] transition to SECONDARY from PRIMARY

And this repeats forever. It looks like some timer is killing the connection with the other replica and this causes the switch between primary and secondary but we have no way to diagnose this. This setup has been running for about 10 years now. The only thing we have changed is to move the virtual machine Horus to another server hardware which is faster. This problem causes all clients to break the connection every few minutes and in some cases crashes some services.

Attached are the log files from all servers.

 



 Comments   
Comment by Vladimir Dimitrov [ 20/Jul/20 ]

Hi Dmitry,

I doubt the issue somehow was magically resolved, but can't provide you with logs rom the past that were automatically cleared by MongoDB de to it's default behavior to do so. If you are unable to see the issue or do anything about it using the logs I have already provided please fell free to close the issue.

Comment by Dmitry Agranat [ 20/Jul/20 ]

Hi vlad.dimitrov@gmail.com,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Dima

Comment by Dmitry Agranat [ 07/Jul/20 ]

Hi vlad.dimitrov@gmail.com,

Since we do not have mongod logs covering the time of this event, we recommend implementing log archiving. Once we'll be able to review the logs covering the time of this event, we'll be able to progress with diagnosing this issue.

Thanks,
Dima

Comment by Vladimir Dimitrov [ 26/Jun/20 ]

Hi Carl. Looks like MongoDB overwrites the log file every time it is restarted. I don't see any other log files that I can send you.

Comment by Carl Champain (Inactive) [ 25/Jun/20 ]

Hi vlad.dimitrov@gmail.com,

The logs for "Osiris" start at 2020-06-15T14:51:38.334+0300, whereas the logs for "Isis" start at 2020-06-15T21:39:05.013+0300 and the logs for "Horus" start at 2020-06-15T18:53:36.852+0300. Can you please provide the logs for "Isis" and "Horus" starting at around 2020-06-15T14:51:38.334+0300 (same time as "Osiris")?

Thank you,
Carl
 

Comment by Vladimir Dimitrov [ 23/Jun/20 ]

I am not sure what caused the problem and what fixed it. The actual problem was that MongoDB was doing something every minute and on an even minute interval was killing it's own connection with the other replica. This was causing a switch in the PRIMARY/SECONDARY state every 1 or 2 or 2 ... or 10 minutes which was resulting in all client connections being killed every time the switch happened. I have no idea what exacly was causing this problem but even if the time was no exactly the same on both replica machines this should not be an expected behavior.

Comment by Carl Champain (Inactive) [ 23/Jun/20 ]

Thank you for the additional information, vlad.dimitrov@gmail.com!

From your previous comment it seems that the issue came from outside of MongoDB. Can you please help clarify what you think the bug is with MongoDB?

Comment by Vladimir Dimitrov [ 22/Jun/20 ]

Hi Carl. Attached is the logs from one of the server limited to one day since these files even compressed became about 200 MB.

 

What I can add is that I somehow managed to stop this from happening. The database is actually a VM running on ESXi host. As I stated in the description this is a new host that I moved the VM to. For some reason the ESXi had a completely messed up timestamp (one day back) and I changed the NTP servers that ESXi used which fixed the time and this issue. Note that the VM actually had correct time. I saw that there is one line repeated hundreds of times:

2020-06-15T19:15:17.276+0300 W ASIO [Replication] Alarm returned early. Expected at: 2020-06-15T19:15:25.030+0300, fired at: 2020-06-15T19:15:17.276+0300

Which led me to look at the times everywhere. Not sure how this change fixed the situation or if it is now fixed, but it looks stable at the moment. Let me know if you need more information to diagnose the issue. As of the moment I have transitioned the replica set to a primary, secondary, secondary but the logs are not cleared.

Comment by Carl Champain (Inactive) [ 22/Jun/20 ]

Hi vlad.dimitrov@gmail.com,

Thank you for the report.
Would you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) corresponding to the time range in the logs and attach it to this ticket?

Kind regards,
Carl 
 

Generated at Thu Feb 08 05:18:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.