Loading...

Type: Bug
Resolution: Incomplete
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.0.18, 4.0.19
Component/s: Replication
Labels:
None

Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We have a Primary-Secondary-Arbiter setup with version 4.0.18 (upgraded to 4.0.19 but still the problem persists) which started experiencing some very strange behavior. The three nodes are called Osiril, Horus and Isis (Arbiter). Every few minutes (from 2 to 10 min looking at the log) we get this in the log:

2020-06-15T18:56:29.734+0300 I REPL [replexec-6] Member Isis.local:27017 is now in state RS_DOWN
2020-06-15T18:56:29.734+0300 I REPL [replexec-6] Member osiris.local:27017 is now in state RS_DOWN
2020-06-15T18:56:29.735+0300 I REPL [replexec-6] can't see a majority of the set, relinquishing primary
2020-06-15T18:56:29.735+0300 I REPL [replexec-6] Stepping down from primary in response to heartbeat
2020-06-15T18:56:29.735+0300 I REPL [replexec-6] transition to SECONDARY from PRIMARY
2020-06-15T18:56:29.735+0300 I NETWORK [replexec-6] Skip closing connection for connection # 40
2020-06-15T18:56:29.735+0300 I NETWORK [replexec-6] Skip closing connection for connection # 35
2020-06-15T18:56:29.735+0300 I REPL [replexec-0] Member osiris.local:27017 is now in state SECONDARY
2020-06-15T18:56:29.735+0300 I REPL [replexec-1] Member Isis.local:27017 is now in state ARBITER

...

2020-06-15T18:56:30.236+0300 I REPL [replexec-1] Member osiris.local:27017 is now in state PRIMARY

...

2020-06-15T18:56:31.225+0300 I REPL [rsBackgroundSync] sync source candidate: osiris.local:27017
2020-06-15T18:56:31.227+0300 I REPL [rsBackgroundSync] Changed sync source from empty to osiris.local:27017
2020-06-15T18:56:31.227+0300 I ASIO [RS] Connecting to osiris.local:27017

...

2020-06-15T18:57:29.753+0300 I REPL [replexec-0] Starting an election, since we've seen no PRIMARY in the past 10000ms
2020-06-15T18:57:29.753+0300 I REPL [replexec-0] conducting a dry run election to see if we could be elected. current term: 1136
2020-06-15T18:57:29.754+0300 I ASIO [Replication] Connecting to Isis.local:27017
2020-06-15T18:57:29.754+0300 I ASIO [Replication] Connecting to osiris.local:27017
2020-06-15T18:57:29.754+0300 I REPL [replexec-1] VoteRequester(term 1136 dry run) received a no vote from Isis.local:27017 with reason "can see a healthy primary (osiris.local:27017) of equal or greater priority"; response message: \{ term: 1136, voteGranted: false, reason: "can see a healthy primary (osiris.local:27017) of equal or greater priority", ok: 1.0 }
2020-06-15T18:57:29.754+0300 I REPL [replexec-0] VoteRequester(term 1136 dry run) received a yes vote from osiris.local:27017; response message: \{ term: 1136, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1592236637, 1), $clusterTime: { clusterTime: Timestamp(1592236637, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
2020-06-15T18:57:29.754+0300 I REPL [replexec-0] dry election run succeeded, running for election in term 1137
2020-06-15T18:57:29.755+0300 I REPL [replexec-6] VoteRequester(term 1137) received a no vote from Isis.local:27017 with reason "can see a healthy primary (osiris.local:27017) of equal or greater priority"; response message: \{ term: 1137, voteGranted: false, reason: "can see a healthy primary (osiris.local:27017) of equal or greater priority", ok: 1.0 }
2020-06-15T18:57:29.759+0300 I REPL [replexec-1] VoteRequester(term 1137) received a yes vote from osiris.local:27017; response message: \{ term: 1137, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1592236637, 1), $clusterTime: { clusterTime: Timestamp(1592236637, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
2020-06-15T18:57:29.759+0300 I REPL [replexec-1] election succeeded, assuming primary role in term 1137
2020-06-15T18:57:29.760+0300 I REPL [replexec-1] transition to PRIMARY from SECONDARY
2020-06-15T18:57:29.760+0300 I REPL [replexec-1] Resetting sync source to empty, which was osiris.local:27017
2020-06-15T18:57:29.760+0300 I REPL [replexec-1] Entering primary catch-up mode.
2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Member osiris.local:27017 is now in state SECONDARY
2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Caught up to the latest optime known via heartbeats after becoming primary. Target optime: \{ ts: Timestamp(1592236637, 1), t: 1136 }. My Last Applied: \{ ts: Timestamp(1592236637, 1), t: 1136 }
2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Exited primary catch-up mode.
2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Stopping replication producer
2020-06-15T18:57:29.761+0300 I REPL [ReplBatcher] Oplog buffer has been drained in term 1137
2020-06-15T18:57:29.761+0300 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source.
2020-06-15T18:57:29.761+0300 I CONNPOOL [RS] Ending connection to host osiris.local:27017 due to bad connection status; 1 connections to that host remain open
2020-06-15T18:57:29.767+0300 I REPL [rsSync-0] transition to primary complete; database writes are now permitted

...

2020-06-15T18:58:29.734+0300 I REPL [replexec-7] Member Isis.local:27017 is now in state RS_DOWN
2020-06-15T18:58:29.735+0300 I REPL [replexec-7] Member osiris.local:27017 is now in state RS_DOWN
2020-06-15T18:58:29.735+0300 I REPL [replexec-7] can't see a majority of the set, relinquishing primary
2020-06-15T18:58:29.735+0300 I REPL [replexec-7] Stepping down from primary in response to heartbeat
2020-06-15T18:58:29.735+0300 I REPL [replexec-7] transition to SECONDARY from PRIMARY

And this repeats forever. It looks like some timer is killing the connection with the other replica and this causes the switch between primary and secondary but we have no way to diagnose this. This setup has been running for about 10 years now. The only thing we have changed is to move the virtual machine Horus to another server hardware which is faster. This problem causes all clients to break the connection every few minutes and in some cases crashes some services.

Attached are the log files from all servers.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

diagnostic.data-horus-rs0.zip
19.40 MB
Jun 22 2020 04:56:28 PM UTC
mongod-horus-rs0.zip
2.11 MB
Jun 18 2020 03:56:41 PM UTC
mongod-isis-rs0.zip
76 kB
Jun 18 2020 03:56:36 PM UTC
mongod-osiris-rs0.zip
3.05 MB
Jun 18 2020 03:56:43 PM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates