Details
-
Bug
-
Status: Closed
-
Major - P3
-
Resolution: Incomplete
-
4.0.18, 4.0.19
-
None
-
None
-
ALL
Description
We have a Primary-Secondary-Arbiter setup with version 4.0.18 (upgraded to 4.0.19 but still the problem persists) which started experiencing some very strange behavior. The three nodes are called Osiril, Horus and Isis (Arbiter). Every few minutes (from 2 to 10 min looking at the log) we get this in the log:
2020-06-15T18:56:29.734+0300 I REPL [replexec-6] Member Isis.local:27017 is now in state RS_DOWN
|
2020-06-15T18:56:29.734+0300 I REPL [replexec-6] Member osiris.local:27017 is now in state RS_DOWN
|
2020-06-15T18:56:29.735+0300 I REPL [replexec-6] can't see a majority of the set, relinquishing primary
|
2020-06-15T18:56:29.735+0300 I REPL [replexec-6] Stepping down from primary in response to heartbeat
|
2020-06-15T18:56:29.735+0300 I REPL [replexec-6] transition to SECONDARY from PRIMARY
|
2020-06-15T18:56:29.735+0300 I NETWORK [replexec-6] Skip closing connection for connection # 40
|
2020-06-15T18:56:29.735+0300 I NETWORK [replexec-6] Skip closing connection for connection # 35
|
2020-06-15T18:56:29.735+0300 I REPL [replexec-0] Member osiris.local:27017 is now in state SECONDARY
|
2020-06-15T18:56:29.735+0300 I REPL [replexec-1] Member Isis.local:27017 is now in state ARBITER
|
|
...
|
|
2020-06-15T18:56:30.236+0300 I REPL [replexec-1] Member osiris.local:27017 is now in state PRIMARY
|
|
...
|
|
2020-06-15T18:56:31.225+0300 I REPL [rsBackgroundSync] sync source candidate: osiris.local:27017
|
2020-06-15T18:56:31.227+0300 I REPL [rsBackgroundSync] Changed sync source from empty to osiris.local:27017
|
2020-06-15T18:56:31.227+0300 I ASIO [RS] Connecting to osiris.local:27017
|
|
...
|
|
2020-06-15T18:57:29.753+0300 I REPL [replexec-0] Starting an election, since we've seen no PRIMARY in the past 10000ms
|
2020-06-15T18:57:29.753+0300 I REPL [replexec-0] conducting a dry run election to see if we could be elected. current term: 1136
|
2020-06-15T18:57:29.754+0300 I ASIO [Replication] Connecting to Isis.local:27017
|
2020-06-15T18:57:29.754+0300 I ASIO [Replication] Connecting to osiris.local:27017
|
2020-06-15T18:57:29.754+0300 I REPL [replexec-1] VoteRequester(term 1136 dry run) received a no vote from Isis.local:27017 with reason "can see a healthy primary (osiris.local:27017) of equal or greater priority"; response message: \{ term: 1136, voteGranted: false, reason: "can see a healthy primary (osiris.local:27017) of equal or greater priority", ok: 1.0 }
|
2020-06-15T18:57:29.754+0300 I REPL [replexec-0] VoteRequester(term 1136 dry run) received a yes vote from osiris.local:27017; response message: \{ term: 1136, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1592236637, 1), $clusterTime: { clusterTime: Timestamp(1592236637, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
|
2020-06-15T18:57:29.754+0300 I REPL [replexec-0] dry election run succeeded, running for election in term 1137
|
2020-06-15T18:57:29.755+0300 I REPL [replexec-6] VoteRequester(term 1137) received a no vote from Isis.local:27017 with reason "can see a healthy primary (osiris.local:27017) of equal or greater priority"; response message: \{ term: 1137, voteGranted: false, reason: "can see a healthy primary (osiris.local:27017) of equal or greater priority", ok: 1.0 }
|
2020-06-15T18:57:29.759+0300 I REPL [replexec-1] VoteRequester(term 1137) received a yes vote from osiris.local:27017; response message: \{ term: 1137, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1592236637, 1), $clusterTime: { clusterTime: Timestamp(1592236637, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
|
2020-06-15T18:57:29.759+0300 I REPL [replexec-1] election succeeded, assuming primary role in term 1137
|
2020-06-15T18:57:29.760+0300 I REPL [replexec-1] transition to PRIMARY from SECONDARY
|
2020-06-15T18:57:29.760+0300 I REPL [replexec-1] Resetting sync source to empty, which was osiris.local:27017
|
2020-06-15T18:57:29.760+0300 I REPL [replexec-1] Entering primary catch-up mode.
|
2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Member osiris.local:27017 is now in state SECONDARY
|
2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Caught up to the latest optime known via heartbeats after becoming primary. Target optime: \{ ts: Timestamp(1592236637, 1), t: 1136 }. My Last Applied: \{ ts: Timestamp(1592236637, 1), t: 1136 }
|
2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Exited primary catch-up mode.
|
2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Stopping replication producer
|
2020-06-15T18:57:29.761+0300 I REPL [ReplBatcher] Oplog buffer has been drained in term 1137
|
2020-06-15T18:57:29.761+0300 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source.
|
2020-06-15T18:57:29.761+0300 I CONNPOOL [RS] Ending connection to host osiris.local:27017 due to bad connection status; 1 connections to that host remain open
|
2020-06-15T18:57:29.767+0300 I REPL [rsSync-0] transition to primary complete; database writes are now permitted
|
|
...
|
|
2020-06-15T18:58:29.734+0300 I REPL [replexec-7] Member Isis.local:27017 is now in state RS_DOWN
|
2020-06-15T18:58:29.735+0300 I REPL [replexec-7] Member osiris.local:27017 is now in state RS_DOWN
|
2020-06-15T18:58:29.735+0300 I REPL [replexec-7] can't see a majority of the set, relinquishing primary
|
2020-06-15T18:58:29.735+0300 I REPL [replexec-7] Stepping down from primary in response to heartbeat
|
2020-06-15T18:58:29.735+0300 I REPL [replexec-7] transition to SECONDARY from PRIMARY
|
|
And this repeats forever. It looks like some timer is killing the connection with the other replica and this causes the switch between primary and secondary but we have no way to diagnose this. This setup has been running for about 10 years now. The only thing we have changed is to move the virtual machine Horus to another server hardware which is faster. This problem causes all clients to break the connection every few minutes and in some cases crashes some services.
Attached are the log files from all servers.