[SERVER-37162] Bad host string given to rs.add() on Replicated Config servers may take down entire cluster Created: 17/Sep/18  Updated: 19/Sep/18  Resolved: 18/Sep/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dave Muysson Assignee: Nick Brewer
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-37190 Don't allow adding replica set connec... Closed
Operating System: ALL
Steps To Reproduce:

In a sharded environment running with a replicated config server, perform an rs.add() that includes the replicaset name in front of the host dns name and port.

 

i.e. rs.add("replsetname/host:port") instead of rs.add("host:port")

Participants:

 Description   

We have a sharded development cluster running 3.2.18 that we are moving from SCCC to Replicated config servers.

The cluster consists of 3 config servers, 18 MongoD shards, and 5 MongoS's.

At one point in the migration process, where we were now using a replicated config server configuration, we accidentally issued a bad rs.add() command from the mongo shell against the Primary Config Server:

 

> rs.add("csReplSet/dev-config-2.domain.com:27019")

We should not have included the initial "csReplSet/" within the string (human error).

However, what happened next was concerning.

While the config servers were fine (rs.status showed it couldn't reach the new host), every MongoS and MongoD host issued a backtrace / core dump and terminated.

 

Here is what the Config server rs.status reported for the added host:

 "_id" : 3,
            "name" : "csReplSet/dev-config-3.domain.com:27019",
            "health" : 0,
            "state" : 8,
            "stateStr" : "(not reachable/healthy)",
            "uptime" : 0,
            "optime" : {
                "ts" : Timestamp(0, 0),
                "t" : NumberLong(-1)
            },

 

Here's the log output from a MongoD (shard) just before the backtrace.

2018-09-14T15:38:55.179+0000 I NETWORK  [ReplicaSetMonitorWatcher] changing hosts to csReplSet/dev-config-0.domain.com:27019,dev-config-4.domain.com:27019,csReplSet/dev-config-3.domain.com:27019 from csReplSet/dev-config-0.domain.com:27019,dev-config-4.domain.com:27019
2018-09-14T15:38:55.179+0000 I -        [ReplicaSetMonitorWatcher] Invariant failure setName == connString.getSetName() src/mongo/s/config.cpp 770
2018-09-14T15:38:55.179+0000 I -        [ReplicaSetMonitorWatcher]

It looks like the MongoS and MongoD hosts tried to adjust their config server list to add the new host, but did not validate the hostname before trying to use it? 

 

We recovered our development environment from backup, and are going to be testing our process again. While I don't have the full list of log files to provide here, we could try this again if you need more details.

Reproducing it shouldn't be hard though, just add a bad host to the config server replicaset!



 Comments   
Comment by Nick Brewer [ 18/Sep/18 ]

dave.muysson@360pi.com We've determined that the best way to prevent this is to strictly disallow including connection strings in a replica set config. We've opened a separate ticket to track this work, which you can follow here: SERVER-37190

Since we're now tracking this elsewhere, I'm going to go ahead and close this ticket. Thanks again for your detailed report, and please let us know if you have any questions.

-Nick

Comment by Spencer Brody (Inactive) [ 18/Sep/18 ]

This does seem like a real bug. I think the proper fix is that we shouldn't allow a replica set connection string for a 'host' field in a replica set config. I filed SERVER-37190 for that change.

Comment by Dave Muysson [ 18/Sep/18 ]

Thanks Nick - Very happy to hear you were able to reproduce it locally!

If there's anything we can do to help out on our end, just let us know.

  • Dave
Comment by Nick Brewer [ 17/Sep/18 ]

dave.muysson@360pi.com Thanks for your report. I've managed to reproduce this, and we're currently investigating.

-Nick

Generated at Thu Feb 08 04:45:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.