[SERVER-25818] Network timeout on Shard lead to 2 primarys Created: 26/Aug/16  Updated: 26/Aug/16  Resolved: 26/Aug/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Stefan Stark Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File Cluster01_Shard0_Backup.txt     Text File Cluster01_Shard0_ExampleMongodConfig.txt     Text File Cluster01_Shard0_Primary.txt     Text File Cluster01_Shard0_Secondary.txt     Text File Cluster01_Shard0_rsConfig.txt    
Operating System: ALL
Participants:

 Description   

A network error lead to continuous elections for a few hours until the Shard finally broke. The main problem can be seen in the following log:



 Comments   
Comment by Kelsey Schubert [ 26/Aug/16 ]

Hi stefan.stark@qplix.com,

Thank you for reporting this issue. After examining the logs, the behavior that you observe is expected given the network conditions. I see that Cluster01_Shard0_Primary repeatedly loses its connection to the rest of the replica set. As a result, it must step down and then call for a priority takeover when it reconnects. To resolve this issue, I would recommend investigating the cause of the network errors. In the interim, you can set the two nodes to same priority, which will prevent elections from continually occurring.

For MongoDB-related support discussion please post on the mongodb-users group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group.

Kind regards,
Thomas

Comment by Ramon Fernandez Marina [ 26/Aug/16 ]

Sorry you've run into this issue stefan.stark@qplix.com, and thanks for uploading the logs – we'll investigate.

Comment by Stefan Stark [ 26/Aug/16 ]

Accidently saved ticket to early. More info:

Primary and Secondary mongod clients lose sight of each other, both get elected to be primary. Once the network is restored the keep stepping down:

2016-08-26T03:57:50.121+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to qpx-r1s4.qplix.com:27120; ExceededTimeLimit: Operation timed out
2016-08-26T03:57:50.121+0200 I REPL     [ReplicationExecutor] Standing for election
2016-08-26T03:57:50.123+0200 I REPL     [ReplicationExecutor] not electing self, qpx-r2s1.qplix.com:27120 would veto with 'qpx-r1s5.qplix.com:27120 is trying to elect itself but qpx-r1s4.qplix.com:27120 is already primary and more up-to-date'
2016-08-26T03:57:50.123+0200 I REPL     [ReplicationExecutor] not electing self, we are not freshest
2016-08-26T03:57:50.337+0200 I REPL     [ReplicationExecutor] could not find member to sync from
2016-08-26T03:57:50.339+0200 I REPL     [ReplicationExecutor] Standing for election
2016-08-26T03:57:50.341+0200 I REPL     [ReplicationExecutor] running for election
2016-08-26T03:57:50.342+0200 I REPL     [ReplicationExecutor] received vote: 1 votes from qpx-r2s1.qplix.com:27120
2016-08-26T03:57:50.342+0200 I REPL     [ReplicationExecutor] election succeeded, assuming primary role
2016-08-26T03:57:50.342+0200 I REPL     [ReplicationExecutor] transition to PRIMARY

OS: Windows Server 2012
All instances run on different Servers
All instances run on mongodb 3.2.8
All instances use the same config template (See attachments)
I added log files for the first 5-10 minutes, although the problem occured for a few hours. In the end a rollback opperation failed and killed the shard.

Generated at Thu Feb 08 04:10:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.