Core Server
  1. Core Server
  2. SERVER-2988

mongos fails to start initially when any config server is unavailable

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major - P3 Major - P3
    • Resolution: Fixed
    • Affects Version/s: 1.8.0
    • Fix Version/s: 2.0.5, 2.1.1
    • Component/s: Sharding
    • Labels:
      None
    • Environment:
      OSX, but reported by a customer probably not on OSX
    • Backport:
      Done
    • Operating System:
      ALL
    • # Replies:
      9
    • Last comment by Customer:
      false

      Description

      When 1 or 2 configdbs present in the mongos command line aren't available, mongos fails to start.

      Here are some reproduction scripts for a single host, and then the log and failure message when 1 or 2 configsrvs are down.

      1. Setup
        mkdir /tmp/logs
        mkdir /tmp/config
        mkdir /tmp/config2
        mkdir /tmp/config3

      killall mongod mongos

      1. Start a configsvr
        ./mongod --port 28000 --logpath /tmp/logs/config.log --dbpath /tmp/config --directoryperdb --configsvr --quiet --logappend --fork
      2. Give the configsvr time to get listening.
        sleep 2
      3. Start a mongos
        ./mongos --port 27000 --logpath /tmp/logs/router.log --configdb "localhost:28000,localhost:28001,localhost:28002" --quiet --logappend --fork
      4. Observe that mongos failed to start
        sleep 2; pgrep mongos || echo "no mongos"
      1. Try again with 2 config servers.
        killall mongod mongos
        ./mongod --port 28000 --logpath /tmp/logs/config.log --dbpath /tmp/config --directoryperdb --configsvr --quiet --logappend --fork
        ./mongod --port 28001 --logpath /tmp/logs/config2.log --dbpath /tmp/config2 --directoryperdb --configsvr --quiet --logappend --fork
      2. Give the configsvrs time to get listening.
        sleep 2
        ./mongos --port 27000 --logpath /tmp/logs/router.log --configdb "localhost:28000,localhost:28001,localhost:28002" --quiet --logappend --fork
        sleep 2; pgrep mongos || echo "no mongos"
      1. Try again with 3 config servers
        killall mongod mongos
        ./mongod --port 28000 --logpath /tmp/logs/config.log --dbpath /tmp/config --directoryperdb --configsvr --quiet --logappend --fork
        ./mongod --port 28001 --logpath /tmp/logs/config2.log --dbpath /tmp/config2 --directoryperdb --configsvr --quiet --logappend --fork
        ./mongod --port 28002 --logpath /tmp/logs/config3.log --dbpath /tmp/config3 --directoryperdb --configsvr --quiet --logappend --fork
      2. Give the configsvrs time to get listening.
        sleep 2
        ./mongos --port 27000 --logpath /tmp/logs/router.log --configdb "localhost:28000,localhost:28001,localhost:28002" --quiet --logappend --fork
        sleep 2; pgrep mongos || echo "no mongos"
              • SERVER RESTARTED *****

      Fri Apr 22 18:19:24 ./mongos db version v1.9.0-pre-, pdfile version 4.5 starting (--help for usage)
      Fri Apr 22 18:19:24 git version: df30db031cd1531705f23fa6ed75a88e2116568f
      Fri Apr 22 18:19:24 build sys info: Darwin Richard-Kreuters-MacBook-Pro.localdomain 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_41
      Fri Apr 22 18:19:24 warning: couldn't check on config server:localhost:28001 ok for now : 11002 socket exception [6] server [localhost:28001] mongos connectionpool error: couldn't connect to server localhost:28001
      Fri Apr 22 18:19:24 warning: couldn't check on config server:localhost:28002 ok for now : 11002 socket exception [6] server [localhost:28002] mongos connectionpool error: couldn't connect to server localhost:28002
      Fri Apr 22 18:19:24 warning: only 1 config server reachable, continuing
      Fri Apr 22 18:19:24 SyncClusterConnection connecting to [localhost:28000]
      Fri Apr 22 18:19:24 SyncClusterConnection connecting to [localhost:28001]
      Fri Apr 22 18:19:24 SyncClusterConnection connect fail to: localhost:28001 errmsg: couldn't connect to server localhost:28001
      Fri Apr 22 18:19:24 SyncClusterConnection connecting to [localhost:28002]
      Fri Apr 22 18:19:24 SyncClusterConnection connect fail to: localhost:28002 errmsg: couldn't connect to server localhost:28002
      Fri Apr 22 18:19:24 trying reconnect to localhost:28001
      Fri Apr 22 18:19:24 reconnect localhost:28001 failed couldn't connect to server localhost:28001
      Fri Apr 22 18:19:24 trying reconnect to localhost:28002
      Fri Apr 22 18:19:24 reconnect localhost:28002 failed couldn't connect to server localhost:28002
      Fri Apr 22 18:19:24 ~ScopedDbConnection: _conn != null
      uncaught exception in mongos main:
      8003 SyncClusterConnection::insert prepare failed: 9001 socket exception [6] localhost:28001:{}9001 socket exception [6] localhost:28002:{}

              • SERVER RESTARTED *****

      Fri Apr 22 18:19:28 ./mongos db version v1.9.0-pre-, pdfile version 4.5 starting (--help for usage)
      Fri Apr 22 18:19:28 git version: df30db031cd1531705f23fa6ed75a88e2116568f
      Fri Apr 22 18:19:28 build sys info: Darwin Richard-Kreuters-MacBook-Pro.localdomain 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_41
      Fri Apr 22 18:19:28 warning: couldn't check on config server:localhost:28002 ok for now : 11002 socket exception [6] server [localhost:28002] mongos connectionpool error: couldn't connect to server localhost:28002
      Fri Apr 22 18:19:28 SyncClusterConnection connecting to [localhost:28000]
      Fri Apr 22 18:19:28 SyncClusterConnection connecting to [localhost:28001]
      Fri Apr 22 18:19:28 SyncClusterConnection connecting to [localhost:28002]
      Fri Apr 22 18:19:28 SyncClusterConnection connect fail to: localhost:28002 errmsg: couldn't connect to server localhost:28002
      Fri Apr 22 18:19:28 trying reconnect to localhost:28002
      Fri Apr 22 18:19:28 reconnect localhost:28002 failed couldn't connect to server localhost:28002
      Fri Apr 22 18:19:28 ~ScopedDbConnection: _conn != null
      uncaught exception in mongos main:
      8003 SyncClusterConnection::insert prepare failed: 9001 socket exception [6] localhost:28002:{}

              • SERVER RESTARTED *****

      Fri Apr 22 18:19:32 ./mongos db version v1.9.0-pre-, pdfile version 4.5 starting (--help for usage)
      Fri Apr 22 18:19:32 git version: df30db031cd1531705f23fa6ed75a88e2116568f
      Fri Apr 22 18:19:32 build sys info: Darwin Richard-Kreuters-MacBook-Pro.localdomain 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_41
      Fri Apr 22 18:19:32 SyncClusterConnection connecting to [localhost:28000]
      Fri Apr 22 18:19:32 SyncClusterConnection connecting to [localhost:28001]
      Fri Apr 22 18:19:32 SyncClusterConnection connecting to [localhost:28002]
      Fri Apr 22 18:19:32 [Balancer] about to contact config servers and shards
      Fri Apr 22 18:19:32 [mongosMain] waiting for connections on port 27000
      Fri Apr 22 18:19:32 [websvr] web admin interface listening on port 28000
      Fri Apr 22 18:19:32 [websvr] listen(): bind() failed errno:48 Address already in use for socket: 0.0.0.0:28000
      Fri Apr 22 18:19:32 [websvr] addr already in use
      Fri Apr 22 18:19:32 [Balancer] config servers and shards contacted successfully
      Fri Apr 22 18:19:32 [Balancer] balancer id: Richard-Kreuters-MacBook-Pro.localdomain:27000 started at Apr 22 18:19:32
      Fri Apr 22 18:19:32 [Balancer] created new distributed lock for balancer on localhost:28000,localhost:28001,localhost:28002 ( lock timeout : 900000, legacy timeout : 0, ping interval : 30000, process : 0, legacy : 0 )
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28000]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28001]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28002]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28000]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28001]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28002]
      Fri Apr 22 18:19:32 [LockPinger] creating distributed lock ping thread for localhost:28000,localhost:28001,localhost:28002 and process Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807 (sleeping for 30000ms)
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28000]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28001]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28002]
      Fri Apr 22 18:19:32 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' acquired, now :

      { _id: "balancer", process: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807", state: 1, ts: ObjectId('4db1fef4c266c1ba533df144'), when: new Date(1303510772809), who: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807:Balanc...", why: "doing balance round" }

      Fri Apr 22 18:19:42 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' unlocked.
      Fri Apr 22 18:19:42 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' acquired, now :

      { _id: "balancer", process: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807", state: 1, ts: ObjectId('4db1fefec266c1ba533df145'), when: new Date(1303510782935), who: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807:Balanc...", why: "doing balance round" }

      Fri Apr 22 18:19:53 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' unlocked.
      Fri Apr 22 18:19:53 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' acquired, now :

      { _id: "balancer", process: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807", state: 1, ts: ObjectId('4db1ff09c266c1ba533df146'), when: new Date(1303510793041), who: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807:Balanc...", why: "doing balance round" }

      Fri Apr 22 18:20:03 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' unlocked.
      Fri Apr 22 18:20:03 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' acquired, now :

      { _id: "balancer", process: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807", state: 1, ts: ObjectId('4db1ff13c266c1ba533df147'), when: new Date(1303510803142), who: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807:Balanc...", why: "doing balance round" }

        Issue Links

          Activity

          Hide
          Greg Studer
          added a comment -

          So this is actually 2 separate issues it appears - the very first time you initialize a cluster, you have to have all three config servers writable to create the config.version collection, which is what the above test reproduces. Opened DOCS-153 to add this to the docs, improved messaging should be added as well.

          Show
          Greg Studer
          added a comment - So this is actually 2 separate issues it appears - the very first time you initialize a cluster, you have to have all three config servers writable to create the config.version collection, which is what the above test reproduces. Opened DOCS-153 to add this to the docs, improved messaging should be added as well.
          Hide
          Greg Studer
          added a comment -
          Show
          Greg Studer
          added a comment - http://codereview.10gen.com/7490071/
          Hide
          Greg Studer
          added a comment -

          Patch above should fix this issue - but related issue SERVER-5240 is not yet reproduced.

          Show
          Greg Studer
          added a comment - Patch above should fix this issue - but related issue SERVER-5240 is not yet reproduced.
          Hide
          Greg Studer
          added a comment -

          Additional try/catch needed in reloadSettings when trying to default-regenerate config.settings chunksize, otherwise aborts mongos startup.

          Show
          Greg Studer
          added a comment - Additional try/catch needed in reloadSettings when trying to default-regenerate config.settings chunksize, otherwise aborts mongos startup.
          Hide
          auto
          added a comment -

          Author:

          {u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

          Message: SERVER-2988 give error message when initial version can't be written on mongos startup
          Branch: master
          https://github.com/mongodb/mongo/commit/3693972015a4e61efa8543eb7d532b9bbf4d44de

          Show
          auto
          added a comment - Author: {u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'} Message: SERVER-2988 give error message when initial version can't be written on mongos startup Branch: master https://github.com/mongodb/mongo/commit/3693972015a4e61efa8543eb7d532b9bbf4d44de
          Hide
          auto
          added a comment -

          Author:

          {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

          Message: SERVER-2988 catch exceptions loading chunksize
          Branch: master
          https://github.com/mongodb/mongo/commit/8784441ef2a234025a57213335fbd5c1fea64564

          Show
          auto
          added a comment - Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'} Message: SERVER-2988 catch exceptions loading chunksize Branch: master https://github.com/mongodb/mongo/commit/8784441ef2a234025a57213335fbd5c1fea64564
          Hide
          Greg Studer
          added a comment -

          Fixed, but additional linked issue may remain.

          Show
          Greg Studer
          added a comment - Fixed, but additional linked issue may remain.
          Hide
          auto
          added a comment -

          Author:

          {u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

          Message: SERVER-2988 give error message when initial version can't be written on mongos startup
          Branch: v2.0
          https://github.com/mongodb/mongo/commit/e69445666bd8acfb1fac1e826e610fac845019a7

          Show
          auto
          added a comment - Author: {u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'} Message: SERVER-2988 give error message when initial version can't be written on mongos startup Branch: v2.0 https://github.com/mongodb/mongo/commit/e69445666bd8acfb1fac1e826e610fac845019a7
          Hide
          auto
          added a comment -

          Author:

          {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

          Message: SERVER-2988 catch exceptions loading chunksize

          Conflicts:

          s/config.cpp
          Branch: v2.0
          https://github.com/mongodb/mongo/commit/ba8d0f572c9108c0c2f279a393a854c59d475008

          Show
          auto
          added a comment - Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'} Message: SERVER-2988 catch exceptions loading chunksize Conflicts: s/config.cpp Branch: v2.0 https://github.com/mongodb/mongo/commit/ba8d0f572c9108c0c2f279a393a854c59d475008

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                2 years, 1 week, 1 day ago
                Date of 1st Reply: