Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-37162

Bad host string given to rs.add() on Replicated Config servers may take down entire cluster

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • ALL
    • Hide

      In a sharded environment running with a replicated config server, perform an rs.add() that includes the replicaset name in front of the host dns name and port.

       

      i.e. rs.add("replsetname/host:port") instead of rs.add("host:port")

      Show
      In a sharded environment running with a replicated config server, perform an rs.add() that includes the replicaset name in front of the host dns name and port.   i.e. rs.add("replsetname/host:port") instead of rs.add("host:port")

      We have a sharded development cluster running 3.2.18 that we are moving from SCCC to Replicated config servers.

      The cluster consists of 3 config servers, 18 MongoD shards, and 5 MongoS's.

      At one point in the migration process, where we were now using a replicated config server configuration, we accidentally issued a bad rs.add() command from the mongo shell against the Primary Config Server:

       

      > rs.add("csReplSet/dev-config-2.domain.com:27019")

      We should not have included the initial "csReplSet/" within the string (human error).

      However, what happened next was concerning.

      While the config servers were fine (rs.status showed it couldn't reach the new host), every MongoS and MongoD host issued a backtrace / core dump and terminated.

       

      Here is what the Config server rs.status reported for the added host:

       "_id" : 3,
                  "name" : "csReplSet/dev-config-3.domain.com:27019",
                  "health" : 0,
                  "state" : 8,
                  "stateStr" : "(not reachable/healthy)",
                  "uptime" : 0,
                  "optime" : {
                      "ts" : Timestamp(0, 0),
                      "t" : NumberLong(-1)
                  },

       

      Here's the log output from a MongoD (shard) just before the backtrace.

      2018-09-14T15:38:55.179+0000 I NETWORK  [ReplicaSetMonitorWatcher] changing hosts to csReplSet/dev-config-0.domain.com:27019,dev-config-4.domain.com:27019,csReplSet/dev-config-3.domain.com:27019 from csReplSet/dev-config-0.domain.com:27019,dev-config-4.domain.com:27019
      2018-09-14T15:38:55.179+0000 I -        [ReplicaSetMonitorWatcher] Invariant failure setName == connString.getSetName() src/mongo/s/config.cpp 770
      2018-09-14T15:38:55.179+0000 I -        [ReplicaSetMonitorWatcher]

      It looks like the MongoS and MongoD hosts tried to adjust their config server list to add the new host, but did not validate the hostname before trying to use it? 

       

      We recovered our development environment from backup, and are going to be testing our process again. While I don't have the full list of log files to provide here, we could try this again if you need more details.

      Reproducing it shouldn't be hard though, just add a bad host to the config server replicaset!

            Assignee:
            nick.brewer Nick Brewer
            Reporter:
            dave.muysson@360pi.com Dave Muysson
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: