[SERVER-17400] ReplSet primary's state sometimes get stuck following reconfig Created: 26/Feb/15  Updated: 05/Jan/18  Resolved: 26/Feb/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mathias Stearn Assignee: Andy Schwerin
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Operating System: ALL
Participants:
Case:

 Description   

In some scenarios, a reconfig operation can cause the primary to get "stuck" in an unelectable state. The node recovers following a restart. The characteristic log lines look like the following:

2015-02-18T17:35:57.494-0500 [conn221] replSet info : additive change to configuration
2015-02-18T17:35:57.494-0500 [conn221] replSet replSetReconfig new config saved locally
 
...
 
2015-02-18T17:36:02.656-0500 [rsMgr] replSet error p != rs->self in checkNewState
2015-02-18T17:36:02.656-0500 [rsMgr] replSet CURRENT_HOST:27017
2015-02-18T17:36:02.656-0500 [rsMgr] replSet CURRENT_HOST:27017

The hostnames in the final two lines are the same.



 Comments   
Comment by Davis Ford [ 09/Dec/16 ]

Andy,

I'm currently trying to add a Mongo 3.0 node to an older production 2.6.5 replica set. I'm seeing a lot of these errors as the new 3.0 node spins up and syncs off one of the current secondaries:

[rsMgr] replSet error p != rs->self in checkNewState

The error is being spewed from the secondary that is being read from for the new 3.0 node (which is also a secondary)

What does this mean and what should I do, if anything? If it is benign, that would be great to hear, but this is a prod system that I'm trying to upgrade and it makes me pretty nervous – googling the errors leads either here or directly to the source. Any words of comfort?

Comment by Andy Schwerin [ 26/Feb/15 ]

This bug does not exist on the 3.0 and master branches, due to extensive refactoring. The code in the 2.6 branch is too racy to make a fix practical. Since this can only happen in a race during reconfig, the workaround of restarting the stuck node will suffice.

Generated at Thu Feb 08 03:44:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.