[SERVER-8145] Two primaries for the same replica set Created: 11/Jan/13  Updated: 10/Dec/14  Resolved: 05/Mar/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.3.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jason Zucchetto Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 12.04.01


Attachments: Text File localdb.txt     Text File logs-rsstats.txt    
Issue Links:
Duplicate
duplicates SERVER-9730 Asymmetrical network partition can ca... Closed
duplicates SERVER-8205 able to create a split primary situat... Closed
duplicates SERVER-9765 Two primaries should cause the earlie... Closed
duplicates SERVER-9848 nonprimaries should assist to resolve... Closed
duplicates SERVER-10375 DNS failures can cause a primary-less... Closed
duplicates SERVER-10575 Two Primaries if Replica Set Heartbea... Closed
Operating System: ALL
Participants:

 Comments   
Comment by Nate Carlson [ 12/Feb/13 ]

We hit a similar issue with two nodes becoming primary due to a resolution issue; our layout looks something like:

  • mongo1-dal05.dal05.example.com (Mongo 2.2.1)
  • mongo2-dal05.dal05.example.com (Mongo 2.2.1)
  • mongo3-dal05.dal05.example.com (Mongo 2.2.1)
  • mongo1-sea01.sea01.example.com (Mongo 2.2.0)
  • mongo1-wdc01.wdc01.example.com (Mongo 2.2.0)

'rs.initiate' was run on mongo1-dal05.dal05.example.com, then 'rs.add' was run on each additional host to add it to the replica set. When 'rs.initiate' added mongo1-dal05, it added it as the short hostname ('mongo1-dal05' instead of 'mongo1-dal05.dal05.example.com'), so the other hosts in dal05.example.com could reach it, but the machines outside that domain could not. With this configuration, no matter what we tried, one node in the 'dal05.example.com' domain would be PRIMARY, and then one of the two nodes outside of 'dal05.example.com' would also list as PRIMARY.

We didn't have time to dig into this in detail; needed to get things back up so just started from scratch again and ensured the FQDN was there for each. If needed, we can try to replicate in a lab environment.

Comment by Jason Zucchetto [ 11/Jan/13 ]

I found the problem, Machine B had an error with etc/hosts and couldn't communicate back to Machine A. Communication was only occurring from Machine A -> Machine B but not Machine B -> Machine A.

It's easy to reproduce, not sure how severe of an issue this is now though (maybe it's a minor bug, however, you can still get to a two primary replica set with the steps above).

Comment by Jason Zucchetto [ 11/Jan/13 ]

I was able to reproduce this with the following steps:

Machine A:

  • rs.initiate()

Machine B:

  • rs.initiate() (doesn't need to occur at the same time as A)

Machine A:

  • rs.add(<Machine B>)
  • rs.status() -> Two primaries

Machine A was in Sydney again (not sure if the network lag matters), Machine B in Northern Virginia. I made sure to spin up new machines for the test.

Comment by Jason Zucchetto [ 11/Jan/13 ]

I believe so (I was working on this in the background and not paying too much attention). The first machine wasn't responding so I switched over to the other machine and tried to initiate there.

The machine with two primaries is in Sydney, the machine with one primary is in Northern Virginia (both AWS).

Comment by Scott Hernandez (Inactive) [ 11/Jan/13 ]

Did you run rs.initiate() at the same time on both of the nodes?

Comment by Scott Hernandez (Inactive) [ 11/Jan/13 ]

Please post logs, mongodump of the local database and rs.status() for both.

Comment by Jason Zucchetto [ 11/Jan/13 ]

Machines are still running in AWS, can provide logins to both

rs0:PRIMARY> rs.status()
{
	"set" : "rs0",
	"date" : ISODate("2013-01-11T01:09:54Z"),
	"myState" : 1,
	"members" : [
		{
			"_id" : 0,
			"name" : "ip-10-240-29-21:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 2928,
			"optime" : {
				"t" : 1357866346000,
				"i" : 1
			},
			"optimeDate" : ISODate("2013-01-11T01:05:46Z"),
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "ip-10-226-97-195:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 248,
			"optime" : {
				"t" : 1357864849000,
				"i" : 1
			},
			"optimeDate" : ISODate("2013-01-11T00:40:49Z"),
			"lastHeartbeat" : ISODate("2013-01-11T01:09:54Z"),
			"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
			"pingMs" : 257
		}
	],
	"ok" : 1
}

Generated at Thu Feb 08 03:16:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.