Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-6178

Cannot use mongos if subset of config servers can't read from or write to disk

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.0.7, 2.2.0-rc0
    • Affects Version/s: 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.0.5, 2.0.6
    • Component/s: Sharding
    • Labels:
    • Environment:
      Any
    • Linux

      This bug does not normally affect the mongo system we have set up. However, when AWS lost power to one of our EBS volumes, it became very apparent that we could not start any more mongos processes, so our production system came down.

      Basics

      While it is not easy to get AWS to lose power to EBS volumes, it is very easy to reproduce this bug using NFS and IPTables. We'll have 1 NFS server, and then 1 NFS client. The client will be running all instances of mongod and mongos. The NFS server will host a single share that the NFS client will use for a single config server.

      NFS Server Setup

      sudo apt-get install nfs-kernel-server
      sudo mkdir /srv/nfs/mongo
      sudo vi /etc/exports

      # /etc/exports
      /srv/nfs/mongo <IP of NFS client>/32(rw,sync,no_subtree_check,no_root_squash)

      sudo /etc/init.d/nfs-kernel-server restart

      NFS Client Setup

      sudo apt-get install nfs-common
      sudo mkdir -p /nfs/mongo
      sudo vi /etc/fstab

      <IP of NFS server>:/srv/nfs/mongo /nfs/mongo nfs4 _netdev,auto 0 0

      sudo mount /nfs/mongo

      Mongo Setup (on same server as NFS Client)

      sudo mkdir /db/a1
      sudo mkdir /db/a2
      sudo mkdir /db/a3
      sudo mkdir /db/b1
      sudo mkdir /db/b2
      sudo mkdir /db/b3
      sudo mkdir /db/c1
      sudo ln -s /nfs/mongo/c2 /db/c2
      sudo mkdir /db/c3

      sudo mkdir /var/run/mongo
      sudo mkdir /db/logs

      /usr/bin/mongod --configsvr --smallfiles --fork --port 27050 --dbpath /db/c1 --logpath /db/logs/c1.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024
      /usr/bin/mongod --configsvr --smallfiles --fork --port 27051 --dbpath /db/c2 --logpath /db/logs/c2.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024
      /usr/bin/mongod --configsvr --smallfiles --fork --port 27052 --dbpath /db/c3 --logpath /db/logs/c3.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024

      /usr/bin/mongod --shardsvr --smallfiles --fork --port 27150 --dbpath /db/a1 --logpath /db/logs/a1.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024 --replSet a
      /usr/bin/mongod --shardsvr --smallfiles --fork --port 27151 --dbpath /db/a2 --logpath /db/logs/a2.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024 --replSet a
      /usr/bin/mongod --shardsvr --smallfiles --fork --port 27152 --dbpath /db/a3 --logpath /db/logs/a3.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024 --replSet a

      /usr/bin/mongod --shardsvr --smallfiles --fork --port 27250 --dbpath /db/b1 --logpath /db/logs/b1.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024 --replSet b
      /usr/bin/mongod --shardsvr --smallfiles --fork --port 27251 --dbpath /db/b2 --logpath /db/logs/b2.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024 --replSet b
      /usr/bin/mongod --shardsvr --smallfiles --fork --port 27252 --dbpath /db/b3 --logpath /db/logs/b3.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024 --replSet b

      sleep 10

      echo "rs.initiate({_id: 'a', members: [{_id: 0, host: 'localhost:27150', priority: 2},{_id: 1, host: 'localhost:27151', priority: 1},{_id: 2, host: 'localhost:27152', priority: 0}]})" | mongo localhost:27150
      echo "rs.initiate({_id: 'b', members: [{_id: 0, host: 'localhost:27250', priority: 2},{_id: 1, host: 'localhost:27251', priority: 1},{_id: 2, host: 'localhost:27252', priority: 0}]})" | mongo localhost:27250

      sleep 30

      {{echo "db.runCommand(

      {addshard: 'a/localhost:27150'}

      )" | mongo admin}}
      {{echo "db.runCommand(

      {addshard: 'b/localhost:27250'}

      )" | mongo admin}}

      In a different terminal (one that can be tied up):

      /usr/bin/mongos --configdb localhost:27050,localhost:27051,localhost:27052 --fork --logpath /var/log/mongos.log --logappend --port 27017 --maxConns 1024

      Notice that mongos starts normally.

      Baseline

      Connect, using mongo, to the mongos process. Insert some items. Find some items. Do whatever. Notice it all works as expected.

      Kill the storage associated with one of the mongod config servers. On the NFS Server:

      sudo iptables -I INPUT -s <IP of NFS client>/32 -j DROP

      Connect, reconnect, etc. using the mongos process. Notice it all still works as expected.

      Bug Manifestation

      Kill the mongos process (Ctrl-C should be fine). After it's down, start it up again using the same command as before.

      /usr/bin/mongos --configdb localhost:27050,localhost:27051,localhost:27052 --fork --logpath /var/log/mongos.log --logappend --port 27017 --maxConns 1024

      Notice that mongos will hang for a minute, and then die.

      Expected Outcome

      Mongos, even though it connected successfully to the config server with the downed data store, should timeout on it's operations, and treat the config server as a downed server; this should result in a successful start of mongos.

            Assignee:
            greg_10gen Greg Studer
            Reporter:
            matthew@lucidchart.com Matthew Barlocker
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: