[SERVER-40182] All replica set members are secondaries Created: 17/Mar/19  Updated: 19/Mar/19  Resolved: 19/Mar/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Pavel Zeger [X] Assignee: Eric Sedor
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

Hi. We have 7 servers in prod in two networks (uk and nl) where one of these servers is Arbiter and another one is Hidden. We encountered a problem with elections despite the fact all servers were online (without network issues): all servers were in the state SECONDARY and we didn't see any error in logs.

I attached a configurations. I suppose we had a majority in such situation

What can be a problem? 

 

[

{ "_id" : 3, "name" : "uk1:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 74206, "optime" : Timestamp(1552820007, 13), "optimeDate" : ISODate("2019-03-17T10:53:27Z"), "lastHeartbeat" : ISODate("2019-03-17T10:53:27.621Z"), "lastHeartbeatRecv" : ISODate("2019-03-17T10:53:27.664Z"), "pingMs" : NumberLong(13), "electionTime" : Timestamp(1552738992, 1), "electionDate" : ISODate("2019-03-16T12:23:12Z"), "configVersion" : 498059 }

,

{ "_id" : 4, "name" : "uk2:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 74206, "optime" : Timestamp(1552820007, 13), "optimeDate" : ISODate("2019-03-17T10:53:27Z"), "lastHeartbeat" : ISODate("2019-03-17T10:53:27.621Z"), "lastHeartbeatRecv" : ISODate("2019-03-17T10:53:27.278Z"), "pingMs" : NumberLong(13), "syncingTo" : "uk1:27017", "configVersion" : 498059 }

,

{ "_id" : 5, "name" : "nl1:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 74206, "optime" : Timestamp(1552820007, 13), "optimeDate" : ISODate("2019-03-17T10:53:27Z"), "lastHeartbeat" : ISODate("2019-03-17T10:53:27.596Z"), "lastHeartbeatRecv" : ISODate("2019-03-17T10:53:27.398Z"), "pingMs" : NumberLong(7), "syncingTo" : "uk1:27017", "configVersion" : 498059 }

,

{ "_id" : 6, "name" : "nl2:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 74206, "optime" : Timestamp(1552820007, 13), "optimeDate" : ISODate("2019-03-17T10:53:27Z"), "lastHeartbeat" : ISODate("2019-03-17T10:53:27.596Z"), "lastHeartbeatRecv" : ISODate("2019-03-17T10:53:25.885Z"), "pingMs" : NumberLong(7), "syncingTo" : "uk1:27017", "configVersion" : 498059 }

,

{ "_id" : 7, "name" : "az1:27017", "health" : 1, "state" : 7, "stateStr" : "ARBITER", "uptime" : 10951400, "syncingTo" : "nl1:27017", "infoMessage" : "syncing from: nl1wv8912:27017", "configVersion" : 498059, "self" : true }

,

{ "_id" : 8, "name" : "uk2:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 74206, "optime" : Timestamp(1552820007, 13), "optimeDate" : ISODate("2019-03-17T10:53:27Z"), "lastHeartbeat" : ISODate("2019-03-17T10:53:27.622Z"), "lastHeartbeatRecv" : ISODate("2019-03-17T10:53:27.623Z"), "pingMs" : NumberLong(14), "syncingTo" : "uk2:27017", "configVersion" : 498059 }

,

{ "_id" : 9, "name" : "nl3:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 74206, "optime" : Timestamp(1552820007, 13), "optimeDate" : ISODate("2019-03-17T10:53:27Z"), "lastHeartbeat" : ISODate("2019-03-17T10:53:27.596Z"), "lastHeartbeatRecv" : ISODate("2019-03-17T10:53:26.222Z"), "pingMs" : NumberLong(7), "syncingTo" : "nl2:27017", "configVersion" : 498059 }

]

{
"_id" : 3,
"host" : "uk1:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 10,
"tags" : {

},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 4,
"host" : "uk2:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 3,
"tags" : {

},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 5,
"host" : "nl1:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {

},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 6,
"host" : "nl2:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {

},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 7,
"host" : "az1:27017",
"arbiterOnly" : true,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {

},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 8,
"host" : "uk3:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : true,
"priority" : 0,
"tags" : {

},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 9,
"host" : "nl3:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {

},
"slaveDelay" : NumberLong(0),
"votes" : 1
}
],



 Comments   
Comment by Eric Sedor [ 19/Mar/19 ]

Thanks. Upgrading to Replication Protocol version 1 (pv1) should help resolve this issue.

After that, we recommend upgrading to MongoDB 3.4+. MongoDB 3.2 reached end of life in September of 2018.

Comment by Pavel Zeger [X] [ 19/Mar/19 ]

3.2

Comment by Eric Sedor [ 18/Mar/19 ]

Hello; can you please clarify the MongoDB version you're running?

Comment by Pavel Zeger [X] [ 17/Mar/19 ]

Also I found on the nl1 host these lines:

2019-03-16T09:34:27.010+0000 I REPL [ReplicationExecutor] uk2:27017 is trying to elect itself but uk1:27017 is already primary
2019-03-16T09:34:28.028+0000 I REPL [ReplicationExecutor] uk2:27017 is trying to elect itself but uk1:27017 is already primary
2019-03-16T09:34:31.044+0000 I REPL [ReplicationExecutor] uk2:27017 is trying to elect itself but uk1:27017 is already primary
2019-03-16T09:34:31.891+0000 I REPL [ReplicationExecutor] uk2:27017 is trying to elect itself but uk1:27017 is already primary
2019-03-16T09:34:32.958+0000 I REPL [ReplicationExecutor] uk2:27017 is trying to elect itself but uk1:27017 is already primary
2019-03-16T09:34:33.090+0000 I REPL [ReplicationExecutor] uk2:27017 is trying to elect itself but uk1:27017 is already primary
2019-03-16T09:34:33.450+0000 I REPL [ReplicationExecutor] uk2:27017 is trying to elect itself but uk1:27017 is already primary
2019-03-16T09:34:35.151+0000 I REPL [ReplicationExecutor] uk2:27017 is trying to elect itself but uk1:27017 is already primary
2019-03-16T09:34:35.684+0000 I REPL [ReplicationExecutor] uk2:27017 is trying to elect itself but uk1:27017 is already primary
2019-03-16T09:34:35.744+0000 I ASIO [NetworkInterfaceASIO-0] Ending connection to host uk1:27017 due to bad connection status; 0 connections to that host remain open
2019-03-16T09:34:35.744+0000 I REPL [ReplicationExecutor] Error in heartbeat request to uk1:27017; ExceededTimeLimit: Operation timed out
2019-03-16T09:34:36.056+0000 I REPL [ReplicationExecutor] replSetElect voting yea for uk2:27017 (4)
2019-03-16T09:34:36.296+0000 I NETWORK [conn2719711] end connection XXX:59720 (5 connections now open)
2019-03-16T09:34:36.376+0000 I NETWORK [initandlisten] connection accepted from XXX:62558 #4602673 (6 connections now open)
2019-03-16T09:34:36.439+0000 I NETWORK [conn4602673] end connection XXX:62558 (5 connections now open)
2019-03-16T09:34:36.632+0000 I NETWORK [initandlisten] connection accepted from XXX:62563 #4602674 (6 connections now open)
2019-03-16T09:34:36.948+0000 I REPL [ReplicationExecutor] replSetElect voting yea for uk2:27017 (4)
2019-03-16T09:34:37.744+0000 I ASIO [NetworkInterfaceASIO-0] Connecting to uk1:27017
2019-03-16T09:34:37.799+0000 I ASIO [NetworkInterfaceASIO-0] Successfully connected to uk1:27017, took 55ms (1 connections now open to uk1:27017)
2019-03-16T09:34:37.800+0000 I REPL [ReplicationExecutor] Member uk1:27017 is now in state PRIMARY
2019-03-16T09:34:37.957+0000 I ASIO [NetworkInterfaceASIO-0] Ending connection to host uk1:27017 due to bad connection status; 1 connections to that host remain open
2019-03-16T09:34:37.957+0000 I REPL [ReplicationExecutor] syncing from: uk1:27017
2019-03-16T09:34:38.522+0000 I REPL [rsBackgroundSync] Chose same sync source candidate as last time, uk1:27017. Sleeping for 1 second to avoid immediately choosing a new sync source for the same reason as last time.
2019-03-16T09:34:38.615+0000 I REPL [ReplicationExecutor] Member uk2:27017 is now in state PRIMARY
2019-03-16T09:34:38.615+0000 W REPL [ReplicationExecutor] two remote primaries (transiently)
2019-03-16T09:34:39.021+0000 W REPL [ReplicationExecutor] two remote primaries (transiently)
2019-03-16T09:34:39.522+0000 I ASIO [rsBackgroundSync] dropping unhealthy pooled connection to XXX:27017
2019-03-16T09:34:39.522+0000 I ASIO [rsBackgroundSync] after drop, pool was empty, going to spawn some connections
2019-03-16T09:34:39.522+0000 I ASIO [NetworkInterfaceASIO-0] Connecting to XXX:27017
2019-03-16T09:34:39.525+0000 I ASIO [NetworkInterfaceASIO-0] Successfully connected to XXX:27017, took 3ms (1 connections now open to 10.59.7.3:27017)
2019-03-16T09:34:39.532+0000 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update: network error while attempting to run command 'replSetUpdatePosition' on host 'uk1:27017'
2019-03-16T09:34:39.534+0000 I REPL [SyncSourceFeedback] updateUpstream failed: HostUnreachable: network error while attempting to run command 'replSetUpdatePosition' on host 'XXX:27017' , will retry
2019-03-16T09:34:39.593+0000 W REPL [rsBackgroundSync] Fetcher error querying oplog: InvalidSyncSource: Upstream node rolled back after verifying that it had our MinValid point. Retrying.
2019-03-16T09:34:39.593+0000 I REPL [ReplicationExecutor] syncing from: nl2:27017
2019-03-16T09:34:39.646+0000 I REPL [SyncSourceFeedback] setting syncSourceFeedback to nl2:27017
2019-03-16T09:34:39.646+0000 I ASIO [NetworkInterfaceASIO-0] Connecting to nl2:27017
2019-03-16T09:34:39.667+0000 I ASIO [NetworkInterfaceASIO-0] Successfully connected to nl2:27017, took 21ms (1 connections now open to nl2:27017)
2019-03-16T09:34:39.679+0000 W REPL [ReplicationExecutor] two remote primaries (transiently)
2019-03-16T09:34:39.731+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: -1, timestamp: Mar 16 09:34:29:6d). source's GTE: (term: -1, timestamp: Mar 16 09:34:38:1) hashes: (-8689705086113376198/-6337150662047189268)
2019-03-16T09:34:39.731+0000 I REPL [rsBackgroundSync] rollback 0
2019-03-16T09:34:39.731+0000 I REPL [ReplicationExecutor] transition to ROLLBACK
2019-03-16T09:34:39.732+0000 I NETWORK [conn4569668] end connection XXX:52300 (

Generated at Thu Feb 08 04:54:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.