[SERVER-17102] primary fails to rejoin set on restart Created: 28/Jan/15  Updated: 29/Jan/15  Resolved: 28/Jan/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.0-rc6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Adam Midvidy Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

rhel55 32-bit


Attachments: Text File mongo1053.log     Text File mongo1054.log     Text File mongo1055.log    
Backwards Compatibility: Fully Compatible
Operating System: Linux
Steps To Reproduce:

start 3 node replSet, 2 mongods, one arbiter.

one mongod has priority 99, other has priority 1.1

start replicaSet, high-priority node becomes primary as expected.

Restart primary after doing a few ops. Primary fails to join set with error:

2015-01-28T21:32:54.167+0000 W REPL     [ReplicationExecutor] Locally stored replica set configuration does not have a valid entry for the current node; waiting for reconfig or remote heartbeat; Got "NodeNotFound No host described in new configuration 1 for replica set 3fe0bcef-8fbd-425a-a6cf-06a3a098cc70 maps to this node" while validating { _id: "3fe0bcef-8fbd-425a-a6cf-06a3a098cc70", version: 1, members: [ { _id: 0, host: "127.0.0.1:1053", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 99.0, tags: { ordinal: "one", dc: "ny" }, slaveDelay: 0, votes: 1 }, { _id: 1, host: "127.0.0.1:1054", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.1, tags: { ordinal: "two", dc: "pa" }, slaveDelay: 0, votes: 1 }, { _id: 2, host: "127.0.0.1:1055", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2015-01-28T21:32:54.167+0000 I REPL     [ReplicationExecutor] new replica set config in use: { _id: "3fe0bcef-8fbd-425a-a6cf-06a3a098cc70", version: 1, members: [ { _id: 0, host: "127.0.0.1:1053", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 99.0, tags: { ordinal: "one", dc: "ny" }, slaveDelay: 0, votes: 1 }, { _id: 1, host: "127.0.0.1:1054", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.1, tags: { ordinal: "two", dc: "pa" }, slaveDelay: 0, votes: 1 }, { _id: 2, host: "127.0.0.1:1055", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2015-01-28T21:32:54.167+0000 I REPL     [ReplicationExecutor] transition to REMOVED

Participants:

 Description   

found in cxx driver test suite on a RHEL 5.5 32bit host.

server git version: ac9ee2fb80f2afc2737a0d9f346cff8117a82af2



 Comments   
Comment by Adam Midvidy [ 28/Jan/15 ]

Scott, having mongo-orchestration set bindIp=127.0.0.1 seems to resolve the issue on our side. Do you want logs for a successful run?

Comment by Scott Hernandez (Inactive) [ 28/Jan/15 ]

When it is stopped, are there any connections open on port 1053? If you start it with bindIp="127.0.0.1" is it fine?

Can you post the logs for those runs as well, thanks.

Comment by Adam Midvidy [ 28/Jan/15 ]

I stopped the process again using 'kill' and restarted it with the same config. It still transitioned to the "REMOVED" state.

Comment by Scott Hernandez (Inactive) [ 28/Jan/15 ]

Also, you should use either 127.0.0.1 or localhost in both the bindIp and replica set configs since they might lead to hard to detect errors otherwise.

Comment by Scott Hernandez (Inactive) [ 28/Jan/15 ]

The line above what you included shows that it could not connect to itself and couldn't verify the config that it was in the replica set, so it moved to REMOVED state as is appropriate:

 Failed to connect to 127.0.0.1:1053, reason: errno:111 Connection refused

Are you sure the process stopped correctly before restarting? How was it stopped? If you restart it again, does it work?

Comment by Adam Midvidy [ 28/Jan/15 ]

added logs

Generated at Thu Feb 08 03:43:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.