Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-2988

mongos fails to start initially when any config server is unavailable

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.0.5, 2.1.1
    • Affects Version/s: 1.8.0
    • Component/s: Sharding
    • Labels:
      None
    • Environment:
      OSX, but reported by a customer probably not on OSX
    • ALL

      When 1 or 2 configdbs present in the mongos command line aren't available, mongos fails to start.

      Here are some reproduction scripts for a single host, and then the log and failure message when 1 or 2 configsrvs are down.

      1. Setup
        mkdir /tmp/logs
        mkdir /tmp/config
        mkdir /tmp/config2
        mkdir /tmp/config3

      killall mongod mongos

      1. Start a configsvr
        ./mongod --port 28000 --logpath /tmp/logs/config.log --dbpath /tmp/config --directoryperdb --configsvr --quiet --logappend --fork
      2. Give the configsvr time to get listening.
        sleep 2
      3. Start a mongos
        ./mongos --port 27000 --logpath /tmp/logs/router.log --configdb "localhost:28000,localhost:28001,localhost:28002" --quiet --logappend --fork
      4. Observe that mongos failed to start
        sleep 2; pgrep mongos || echo "no mongos"
      1. Try again with 2 config servers.
        killall mongod mongos
        ./mongod --port 28000 --logpath /tmp/logs/config.log --dbpath /tmp/config --directoryperdb --configsvr --quiet --logappend --fork
        ./mongod --port 28001 --logpath /tmp/logs/config2.log --dbpath /tmp/config2 --directoryperdb --configsvr --quiet --logappend --fork
      2. Give the configsvrs time to get listening.
        sleep 2
        ./mongos --port 27000 --logpath /tmp/logs/router.log --configdb "localhost:28000,localhost:28001,localhost:28002" --quiet --logappend --fork
        sleep 2; pgrep mongos || echo "no mongos"
      1. Try again with 3 config servers
        killall mongod mongos
        ./mongod --port 28000 --logpath /tmp/logs/config.log --dbpath /tmp/config --directoryperdb --configsvr --quiet --logappend --fork
        ./mongod --port 28001 --logpath /tmp/logs/config2.log --dbpath /tmp/config2 --directoryperdb --configsvr --quiet --logappend --fork
        ./mongod --port 28002 --logpath /tmp/logs/config3.log --dbpath /tmp/config3 --directoryperdb --configsvr --quiet --logappend --fork
      2. Give the configsvrs time to get listening.
        sleep 2
        ./mongos --port 27000 --logpath /tmp/logs/router.log --configdb "localhost:28000,localhost:28001,localhost:28002" --quiet --logappend --fork
        sleep 2; pgrep mongos || echo "no mongos"
              • SERVER RESTARTED *****

      Fri Apr 22 18:19:24 ./mongos db version v1.9.0-pre-, pdfile version 4.5 starting (--help for usage)
      Fri Apr 22 18:19:24 git version: df30db031cd1531705f23fa6ed75a88e2116568f
      Fri Apr 22 18:19:24 build sys info: Darwin Richard-Kreuters-MacBook-Pro.localdomain 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_41
      Fri Apr 22 18:19:24 warning: couldn't check on config server:localhost:28001 ok for now : 11002 socket exception [6] server [localhost:28001] mongos connectionpool error: couldn't connect to server localhost:28001
      Fri Apr 22 18:19:24 warning: couldn't check on config server:localhost:28002 ok for now : 11002 socket exception [6] server [localhost:28002] mongos connectionpool error: couldn't connect to server localhost:28002
      Fri Apr 22 18:19:24 warning: only 1 config server reachable, continuing
      Fri Apr 22 18:19:24 SyncClusterConnection connecting to [localhost:28000]
      Fri Apr 22 18:19:24 SyncClusterConnection connecting to [localhost:28001]
      Fri Apr 22 18:19:24 SyncClusterConnection connect fail to: localhost:28001 errmsg: couldn't connect to server localhost:28001
      Fri Apr 22 18:19:24 SyncClusterConnection connecting to [localhost:28002]
      Fri Apr 22 18:19:24 SyncClusterConnection connect fail to: localhost:28002 errmsg: couldn't connect to server localhost:28002
      Fri Apr 22 18:19:24 trying reconnect to localhost:28001
      Fri Apr 22 18:19:24 reconnect localhost:28001 failed couldn't connect to server localhost:28001
      Fri Apr 22 18:19:24 trying reconnect to localhost:28002
      Fri Apr 22 18:19:24 reconnect localhost:28002 failed couldn't connect to server localhost:28002
      Fri Apr 22 18:19:24 ~ScopedDbConnection: _conn != null
      uncaught exception in mongos main:
      8003 SyncClusterConnection::insert prepare failed: 9001 socket exception [6] localhost:28001:{}9001 socket exception [6] localhost:28002:{}

              • SERVER RESTARTED *****

      Fri Apr 22 18:19:28 ./mongos db version v1.9.0-pre-, pdfile version 4.5 starting (--help for usage)
      Fri Apr 22 18:19:28 git version: df30db031cd1531705f23fa6ed75a88e2116568f
      Fri Apr 22 18:19:28 build sys info: Darwin Richard-Kreuters-MacBook-Pro.localdomain 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_41
      Fri Apr 22 18:19:28 warning: couldn't check on config server:localhost:28002 ok for now : 11002 socket exception [6] server [localhost:28002] mongos connectionpool error: couldn't connect to server localhost:28002
      Fri Apr 22 18:19:28 SyncClusterConnection connecting to [localhost:28000]
      Fri Apr 22 18:19:28 SyncClusterConnection connecting to [localhost:28001]
      Fri Apr 22 18:19:28 SyncClusterConnection connecting to [localhost:28002]
      Fri Apr 22 18:19:28 SyncClusterConnection connect fail to: localhost:28002 errmsg: couldn't connect to server localhost:28002
      Fri Apr 22 18:19:28 trying reconnect to localhost:28002
      Fri Apr 22 18:19:28 reconnect localhost:28002 failed couldn't connect to server localhost:28002
      Fri Apr 22 18:19:28 ~ScopedDbConnection: _conn != null
      uncaught exception in mongos main:
      8003 SyncClusterConnection::insert prepare failed: 9001 socket exception [6] localhost:28002:{}

              • SERVER RESTARTED *****

      Fri Apr 22 18:19:32 ./mongos db version v1.9.0-pre-, pdfile version 4.5 starting (--help for usage)
      Fri Apr 22 18:19:32 git version: df30db031cd1531705f23fa6ed75a88e2116568f
      Fri Apr 22 18:19:32 build sys info: Darwin Richard-Kreuters-MacBook-Pro.localdomain 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_41
      Fri Apr 22 18:19:32 SyncClusterConnection connecting to [localhost:28000]
      Fri Apr 22 18:19:32 SyncClusterConnection connecting to [localhost:28001]
      Fri Apr 22 18:19:32 SyncClusterConnection connecting to [localhost:28002]
      Fri Apr 22 18:19:32 [Balancer] about to contact config servers and shards
      Fri Apr 22 18:19:32 [mongosMain] waiting for connections on port 27000
      Fri Apr 22 18:19:32 [websvr] web admin interface listening on port 28000
      Fri Apr 22 18:19:32 [websvr] listen(): bind() failed errno:48 Address already in use for socket: 0.0.0.0:28000
      Fri Apr 22 18:19:32 [websvr] addr already in use
      Fri Apr 22 18:19:32 [Balancer] config servers and shards contacted successfully
      Fri Apr 22 18:19:32 [Balancer] balancer id: Richard-Kreuters-MacBook-Pro.localdomain:27000 started at Apr 22 18:19:32
      Fri Apr 22 18:19:32 [Balancer] created new distributed lock for balancer on localhost:28000,localhost:28001,localhost:28002 ( lock timeout : 900000, legacy timeout : 0, ping interval : 30000, process : 0, legacy : 0 )
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28000]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28001]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28002]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28000]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28001]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28002]
      Fri Apr 22 18:19:32 [LockPinger] creating distributed lock ping thread for localhost:28000,localhost:28001,localhost:28002 and process Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807 (sleeping for 30000ms)
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28000]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28001]
      Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28002]
      Fri Apr 22 18:19:32 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' acquired, now :

      { _id: "balancer", process: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807", state: 1, ts: ObjectId('4db1fef4c266c1ba533df144'), when: new Date(1303510772809), who: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807:Balanc...", why: "doing balance round" }

      Fri Apr 22 18:19:42 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' unlocked.
      Fri Apr 22 18:19:42 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' acquired, now :

      { _id: "balancer", process: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807", state: 1, ts: ObjectId('4db1fefec266c1ba533df145'), when: new Date(1303510782935), who: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807:Balanc...", why: "doing balance round" }

      Fri Apr 22 18:19:53 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' unlocked.
      Fri Apr 22 18:19:53 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' acquired, now :

      { _id: "balancer", process: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807", state: 1, ts: ObjectId('4db1ff09c266c1ba533df146'), when: new Date(1303510793041), who: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807:Balanc...", why: "doing balance round" }

      Fri Apr 22 18:20:03 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' unlocked.
      Fri Apr 22 18:20:03 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' acquired, now :

      { _id: "balancer", process: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807", state: 1, ts: ObjectId('4db1ff13c266c1ba533df147'), when: new Date(1303510803142), who: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807:Balanc...", why: "doing balance round" }

            Assignee:
            greg_10gen Greg Studer
            Reporter:
            richard.kreuter Richard Kreuter (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: