[SERVER-2988] mongos fails to start initially when any config server is unavailable Created: 22/Apr/11  Updated: 12/Jul/16  Resolved: 29/Mar/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 1.8.0
Fix Version/s: 2.0.5, 2.1.1

Type: Bug Priority: Major - P3
Reporter: Richard Kreuter (Inactive) Assignee: Greg Studer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OSX, but reported by a customer probably not on OSX


Issue Links:
Depends
depends on DOCS-153 docs should mention special-case of m... Closed
Related
related to SERVER-1936 can't start mongos if only a subset o... Closed
is related to SERVER-5240 mongos fails to connect to active clu... Closed
Operating System: ALL
Participants:

 Description   

When 1 or 2 configdbs present in the mongos command line aren't available, mongos fails to start.

Here are some reproduction scripts for a single host, and then the log and failure message when 1 or 2 configsrvs are down.

  1. Setup
    mkdir /tmp/logs
    mkdir /tmp/config
    mkdir /tmp/config2
    mkdir /tmp/config3

killall mongod mongos

  1. Start a configsvr
    ./mongod --port 28000 --logpath /tmp/logs/config.log --dbpath /tmp/config --directoryperdb --configsvr --quiet --logappend --fork
  2. Give the configsvr time to get listening.
    sleep 2
  3. Start a mongos
    ./mongos --port 27000 --logpath /tmp/logs/router.log --configdb "localhost:28000,localhost:28001,localhost:28002" --quiet --logappend --fork
  4. Observe that mongos failed to start
    sleep 2; pgrep mongos || echo "no mongos"
  1. Try again with 2 config servers.
    killall mongod mongos
    ./mongod --port 28000 --logpath /tmp/logs/config.log --dbpath /tmp/config --directoryperdb --configsvr --quiet --logappend --fork
    ./mongod --port 28001 --logpath /tmp/logs/config2.log --dbpath /tmp/config2 --directoryperdb --configsvr --quiet --logappend --fork
  2. Give the configsvrs time to get listening.
    sleep 2
    ./mongos --port 27000 --logpath /tmp/logs/router.log --configdb "localhost:28000,localhost:28001,localhost:28002" --quiet --logappend --fork
    sleep 2; pgrep mongos || echo "no mongos"
  1. Try again with 3 config servers
    killall mongod mongos
    ./mongod --port 28000 --logpath /tmp/logs/config.log --dbpath /tmp/config --directoryperdb --configsvr --quiet --logappend --fork
    ./mongod --port 28001 --logpath /tmp/logs/config2.log --dbpath /tmp/config2 --directoryperdb --configsvr --quiet --logappend --fork
    ./mongod --port 28002 --logpath /tmp/logs/config3.log --dbpath /tmp/config3 --directoryperdb --configsvr --quiet --logappend --fork
  2. Give the configsvrs time to get listening.
    sleep 2
    ./mongos --port 27000 --logpath /tmp/logs/router.log --configdb "localhost:28000,localhost:28001,localhost:28002" --quiet --logappend --fork
    sleep 2; pgrep mongos || echo "no mongos"
          • SERVER RESTARTED *****

Fri Apr 22 18:19:24 ./mongos db version v1.9.0-pre-, pdfile version 4.5 starting (--help for usage)
Fri Apr 22 18:19:24 git version: df30db031cd1531705f23fa6ed75a88e2116568f
Fri Apr 22 18:19:24 build sys info: Darwin Richard-Kreuters-MacBook-Pro.localdomain 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_41
Fri Apr 22 18:19:24 warning: couldn't check on config server:localhost:28001 ok for now : 11002 socket exception [6] server [localhost:28001] mongos connectionpool error: couldn't connect to server localhost:28001
Fri Apr 22 18:19:24 warning: couldn't check on config server:localhost:28002 ok for now : 11002 socket exception [6] server [localhost:28002] mongos connectionpool error: couldn't connect to server localhost:28002
Fri Apr 22 18:19:24 warning: only 1 config server reachable, continuing
Fri Apr 22 18:19:24 SyncClusterConnection connecting to [localhost:28000]
Fri Apr 22 18:19:24 SyncClusterConnection connecting to [localhost:28001]
Fri Apr 22 18:19:24 SyncClusterConnection connect fail to: localhost:28001 errmsg: couldn't connect to server localhost:28001
Fri Apr 22 18:19:24 SyncClusterConnection connecting to [localhost:28002]
Fri Apr 22 18:19:24 SyncClusterConnection connect fail to: localhost:28002 errmsg: couldn't connect to server localhost:28002
Fri Apr 22 18:19:24 trying reconnect to localhost:28001
Fri Apr 22 18:19:24 reconnect localhost:28001 failed couldn't connect to server localhost:28001
Fri Apr 22 18:19:24 trying reconnect to localhost:28002
Fri Apr 22 18:19:24 reconnect localhost:28002 failed couldn't connect to server localhost:28002
Fri Apr 22 18:19:24 ~ScopedDbConnection: _conn != null
uncaught exception in mongos main:
8003 SyncClusterConnection::insert prepare failed: 9001 socket exception [6] localhost:28001:{}9001 socket exception [6] localhost:28002:{}

          • SERVER RESTARTED *****

Fri Apr 22 18:19:28 ./mongos db version v1.9.0-pre-, pdfile version 4.5 starting (--help for usage)
Fri Apr 22 18:19:28 git version: df30db031cd1531705f23fa6ed75a88e2116568f
Fri Apr 22 18:19:28 build sys info: Darwin Richard-Kreuters-MacBook-Pro.localdomain 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_41
Fri Apr 22 18:19:28 warning: couldn't check on config server:localhost:28002 ok for now : 11002 socket exception [6] server [localhost:28002] mongos connectionpool error: couldn't connect to server localhost:28002
Fri Apr 22 18:19:28 SyncClusterConnection connecting to [localhost:28000]
Fri Apr 22 18:19:28 SyncClusterConnection connecting to [localhost:28001]
Fri Apr 22 18:19:28 SyncClusterConnection connecting to [localhost:28002]
Fri Apr 22 18:19:28 SyncClusterConnection connect fail to: localhost:28002 errmsg: couldn't connect to server localhost:28002
Fri Apr 22 18:19:28 trying reconnect to localhost:28002
Fri Apr 22 18:19:28 reconnect localhost:28002 failed couldn't connect to server localhost:28002
Fri Apr 22 18:19:28 ~ScopedDbConnection: _conn != null
uncaught exception in mongos main:
8003 SyncClusterConnection::insert prepare failed: 9001 socket exception [6] localhost:28002:{}

          • SERVER RESTARTED *****

Fri Apr 22 18:19:32 ./mongos db version v1.9.0-pre-, pdfile version 4.5 starting (--help for usage)
Fri Apr 22 18:19:32 git version: df30db031cd1531705f23fa6ed75a88e2116568f
Fri Apr 22 18:19:32 build sys info: Darwin Richard-Kreuters-MacBook-Pro.localdomain 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_41
Fri Apr 22 18:19:32 SyncClusterConnection connecting to [localhost:28000]
Fri Apr 22 18:19:32 SyncClusterConnection connecting to [localhost:28001]
Fri Apr 22 18:19:32 SyncClusterConnection connecting to [localhost:28002]
Fri Apr 22 18:19:32 [Balancer] about to contact config servers and shards
Fri Apr 22 18:19:32 [mongosMain] waiting for connections on port 27000
Fri Apr 22 18:19:32 [websvr] web admin interface listening on port 28000
Fri Apr 22 18:19:32 [websvr] listen(): bind() failed errno:48 Address already in use for socket: 0.0.0.0:28000
Fri Apr 22 18:19:32 [websvr] addr already in use
Fri Apr 22 18:19:32 [Balancer] config servers and shards contacted successfully
Fri Apr 22 18:19:32 [Balancer] balancer id: Richard-Kreuters-MacBook-Pro.localdomain:27000 started at Apr 22 18:19:32
Fri Apr 22 18:19:32 [Balancer] created new distributed lock for balancer on localhost:28000,localhost:28001,localhost:28002 ( lock timeout : 900000, legacy timeout : 0, ping interval : 30000, process : 0, legacy : 0 )
Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28000]
Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28001]
Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28002]
Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28000]
Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28001]
Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28002]
Fri Apr 22 18:19:32 [LockPinger] creating distributed lock ping thread for localhost:28000,localhost:28001,localhost:28002 and process Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807 (sleeping for 30000ms)
Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28000]
Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28001]
Fri Apr 22 18:19:32 [Balancer] SyncClusterConnection connecting to [localhost:28002]
Fri Apr 22 18:19:32 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' acquired, now :

{ _id: "balancer", process: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807", state: 1, ts: ObjectId('4db1fef4c266c1ba533df144'), when: new Date(1303510772809), who: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807:Balanc...", why: "doing balance round" }

Fri Apr 22 18:19:42 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' unlocked.
Fri Apr 22 18:19:42 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' acquired, now :

{ _id: "balancer", process: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807", state: 1, ts: ObjectId('4db1fefec266c1ba533df145'), when: new Date(1303510782935), who: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807:Balanc...", why: "doing balance round" }

Fri Apr 22 18:19:53 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' unlocked.
Fri Apr 22 18:19:53 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' acquired, now :

{ _id: "balancer", process: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807", state: 1, ts: ObjectId('4db1ff09c266c1ba533df146'), when: new Date(1303510793041), who: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807:Balanc...", why: "doing balance round" }

Fri Apr 22 18:20:03 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' unlocked.
Fri Apr 22 18:20:03 [Balancer] distributed lock 'balancer/Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807' acquired, now :

{ _id: "balancer", process: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807", state: 1, ts: ObjectId('4db1ff13c266c1ba533df147'), when: new Date(1303510803142), who: "Richard-Kreuters-MacBook-Pro.localdomain:27000:1303510772:16807:Balanc...", why: "doing balance round" }

 Comments   
Comment by auto [ 18/Apr/12 ]

Author:

{u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

Message: SERVER-2988 catch exceptions loading chunksize

Conflicts:

s/config.cpp
Branch: v2.0
https://github.com/mongodb/mongo/commit/ba8d0f572c9108c0c2f279a393a854c59d475008

Comment by auto [ 18/Apr/12 ]

Author:

{u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-2988 give error message when initial version can't be written on mongos startup
Branch: v2.0
https://github.com/mongodb/mongo/commit/e69445666bd8acfb1fac1e826e610fac845019a7

Comment by Greg Studer [ 29/Mar/12 ]

Fixed, but additional linked issue may remain.

Comment by auto [ 29/Mar/12 ]

Author:

{u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

Message: SERVER-2988 catch exceptions loading chunksize
Branch: master
https://github.com/mongodb/mongo/commit/8784441ef2a234025a57213335fbd5c1fea64564

Comment by auto [ 29/Mar/12 ]

Author:

{u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-2988 give error message when initial version can't be written on mongos startup
Branch: master
https://github.com/mongodb/mongo/commit/3693972015a4e61efa8543eb7d532b9bbf4d44de

Comment by Greg Studer [ 07/Mar/12 ]

Additional try/catch needed in reloadSettings when trying to default-regenerate config.settings chunksize, otherwise aborts mongos startup.

Comment by Greg Studer [ 07/Mar/12 ]

Patch above should fix this issue - but related issue SERVER-5240 is not yet reproduced.

Comment by Greg Studer [ 07/Mar/12 ]

http://codereview.10gen.com/7490071/

Comment by Greg Studer [ 07/Mar/12 ]

So this is actually 2 separate issues it appears - the very first time you initialize a cluster, you have to have all three config servers writable to create the config.version collection, which is what the above test reproduces. Opened DOCS-153 to add this to the docs, improved messaging should be added as well.

Generated at Thu Feb 08 03:01:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.