[SERVER-8145] Two primaries for the same replica set Created: 11/Jan/13 Updated: 10/Dec/14 Resolved: 05/Mar/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.3.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jason Zucchetto | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 12.04.01 |
||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Comments |
| Comment by Nate Carlson [ 12/Feb/13 ] | |||||||||||||||||||||||||||||||||||||||
|
We hit a similar issue with two nodes becoming primary due to a resolution issue; our layout looks something like:
'rs.initiate' was run on mongo1-dal05.dal05.example.com, then 'rs.add' was run on each additional host to add it to the replica set. When 'rs.initiate' added mongo1-dal05, it added it as the short hostname ('mongo1-dal05' instead of 'mongo1-dal05.dal05.example.com'), so the other hosts in dal05.example.com could reach it, but the machines outside that domain could not. With this configuration, no matter what we tried, one node in the 'dal05.example.com' domain would be PRIMARY, and then one of the two nodes outside of 'dal05.example.com' would also list as PRIMARY. We didn't have time to dig into this in detail; needed to get things back up so just started from scratch again and ensured the FQDN was there for each. If needed, we can try to replicate in a lab environment. | |||||||||||||||||||||||||||||||||||||||
| Comment by Jason Zucchetto [ 11/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||
|
I found the problem, Machine B had an error with etc/hosts and couldn't communicate back to Machine A. Communication was only occurring from Machine A -> Machine B but not Machine B -> Machine A. It's easy to reproduce, not sure how severe of an issue this is now though (maybe it's a minor bug, however, you can still get to a two primary replica set with the steps above). | |||||||||||||||||||||||||||||||||||||||
| Comment by Jason Zucchetto [ 11/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||
|
I was able to reproduce this with the following steps: Machine A:
Machine B:
Machine A:
Machine A was in Sydney again (not sure if the network lag matters), Machine B in Northern Virginia. I made sure to spin up new machines for the test. | |||||||||||||||||||||||||||||||||||||||
| Comment by Jason Zucchetto [ 11/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||
|
I believe so (I was working on this in the background and not paying too much attention). The first machine wasn't responding so I switched over to the other machine and tried to initiate there. The machine with two primaries is in Sydney, the machine with one primary is in Northern Virginia (both AWS). | |||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 11/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||
|
Did you run rs.initiate() at the same time on both of the nodes? | |||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 11/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||
|
Please post logs, mongodump of the local database and rs.status() for both. | |||||||||||||||||||||||||||||||||||||||
| Comment by Jason Zucchetto [ 11/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||
|
Machines are still running in AWS, can provide logins to both
|