[SERVER-16277] Removing Replica Set Member + Failover = Can't write to replica set with w:2 Created: 21/Nov/14  Updated: 07/Apr/23  Resolved: 18/Mar/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.3, 2.6.5
Fix Version/s: 2.6.6

Type: Bug Priority: Major - P3
Reporter: Jason Ford Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

1. Initialize replica set like this:

{
        "_id" : "test",
        "version" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "myServer:27001",
                        "priority" : 50
                },
                {
                        "_id" : 1,
                        "host" : "myServer:27002",
                        "priority" : 50
                },
                {
                        "_id" : 2,
                        "host" : "myServer:27003",
                        "priority" : 0
                },
                {
                        "_id" : 3,
                        "host" : "myServer:27004",
                        "priority" : 10
                },
        ]
}

2. Add a row, using w:2 (works)

db.test.insert({x:1}, { writeConcern : { w:2, wtimeout: 15000 }})

3. Reconfigure replica set with this configuration:

{
        "_id" : "test",
        "version" : 2,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "myServer:27001",
                        "priority" : 50
                },
                {
                        "_id" : 1,
                        "host" : "myServer:27002",
                        "priority" : 50
                },
                {
                        "_id" : 2,
                        "host" : "myServer:27003",
                        "priority" : 0
                }
        ]
}

4. Assuming the same server is primary before and after the reconfig, this will work:

db.test.insert({x:2}, { writeConcern : { w:2, wtimeout: 15000 }})

5. Failover to the server on 27002 (rs.stepDown())

6. This operation times out:

db.test.insert({x:3}, { writeConcern : { w:2, wtimeout: 15000 }})

7. Fail back to 27001

8. This works again:

db.test.insert({x:2}, { writeConcern : { w:2, wtimeout: 15000 }})

Participants:

 Description   

We recently reduced the number of nodes in our replica set from 4 (3 + 1 hidden) to 3. After removing the 4th node and changing the configuration of the other, the cluster comes back just fine. After failing over, the cluster won't take any writes with a write concern > 1. If you fail back to the original primary, the replica works fine. There is a workaround - you can simply restart all mongod processes after the reconfig, and everything works. We have been able to consistently reproduce this bug in versions 2.6.3 and 2.6.5. It does appear that the issue is not present in the rc0 version of 2.8.0.



 Comments   
Comment by Ramon Fernandez Marina [ 18/Mar/15 ]

fordjp, thanks for the detailed reproducer. This problem was fixed in 2.6.6, most probably as a side-effect of SERVER-15849. If you haven't upgraded yet please consider doing so.

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 18/Mar/15 ]

fordjp, apologies for the long delay in getting back to you. I can reproduce the behavior you describe on 2.6.5 and I'm investigating.

Generated at Thu Feb 08 03:40:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.