[SERVER-21215] Shards should retry contacting config servers indefinitely Created: 30/Oct/15  Updated: 21/Jul/17  Resolved: 18/Nov/15

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.0-rc1
Fix Version/s: 3.2.0-rc4

Type: Bug Priority: Major - P3
Reporter: John Page Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
is documented by DOCS-8954 Shards should retry contacting config... Closed
Backwards Compatibility: Minor Change
Operating System: ALL
Sprint: Sharding C (11/20/15)
Participants:

 Description   

Root cause found - see comment 4

For testing I set up a 3.2-rc1 cluster with 32 shards. WT, Zlib - EVERYTHING on LOCALHOST

I used a SINGLE , non replica set config server which rc1 allows me to do ( I raised a previous ticket that it should)

I was able to successfully load 800GB of sharded data using a Java based loader.

After restarting the cluster I get.

MongoDB Enterprise mongos> show dbs
2015-10-30T10:30:24.140+0000 E QUERY    [thread1] Error: listDatabases failed:{ "code" : 6, "ok" : 0, "errmsg" : "Connection refused" } :
_getErrorWithCode@src/mongo/shell/utils.js:23:13
Mongo.prototype.getDBs@src/mongo/shell/mongo.js:53:1
shellHelper.show@src/mongo/shell/utils.js:697:19
shellHelper@src/mongo/shell/utils.js:591:15
@(shellhelp2):1:1
 
MongoDB Enterprise mongos> 

sh.status appears to work OK

Show collections gives

MongoDB Enterprise mongos> show collections
2015-10-30T10:31:16.473+0000 E QUERY    [thread1] Error: listCollections failed: {
	"code" : 13328,
	"ok" : 0,
	"errmsg" : "connection pool: connect failed localhost:27102 : couldn't connect to server localhost:27102, connection attempt failed"
} :
_getErrorWithCode@src/mongo/shell/utils.js:23:13
DB.prototype._getCollectionInfosCommand@src/mongo/shell/db.js:746:1
DB.prototype.getCollectionInfos@src/mongo/shell/db.js:758:15
DB.prototype.getCollectionNames@src/mongo/shell/db.js:769:12
shellHelper.show@src/mongo/shell/utils.js:692:9
shellHelper@src/mongo/shell/utils.js:591:15
@(shellhelp2):1:1

All mongod logs show

2015-10-30T10:23:04.744+0000 I CONTROL  [initandlisten] options: { net: { port: 27101 }, processManagement: { fork: true }, storage: { dbPath: "/data/shard1", journal: { enabled: false }, wiredTiger: { collectionConfig: { blockCompressor: "zlib" }, engineConfig: { cacheSizeGB: 1 } } }, systemLog: { destination: "file", path: "/data/log1.log" } }
2015-10-30T10:23:04.746+0000 I FTDC     [initandlisten] Starting full-time diagnostic data capture with directory '/data/shard1/diagnostic.data'
2015-10-30T10:23:04.746+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2015-10-30T10:23:04.746+0000 I SHARDING [initandlisten] Sharding state recovery process found document { _id: "minOpTimeRecovery", configsvrConnectionString: "localhost:27019", shardName: "shard0000", minOpTime: { ts: Timestamp 0|0, t: -1 }, minOpTimeUpdaters: 0 }
2015-10-30T10:23:04.746+0000 I SHARDING [initandlisten] first cluster operation detected, adding sharding hook to enable versioning and authentication to remote servers
2015-10-30T10:23:04.747+0000 I SHARDING [initandlisten] Updating config server connection string to: localhost:27019
2015-10-30T10:23:04.748+0000 W NETWORK  [initandlisten] Failed to connect to 127.0.0.1:27019, reason: errno:111 Connection refused
2015-10-30T10:23:04.748+0000 I STORAGE  [initandlisten] exception in initAndListen: 13328 connection pool: connect failed localhost:27019 : couldn't connect to server localhost:27019, connection attempt failed, terminating
2015-10-30T10:23:04.748+0000 I FTDC     [initandlisten] Stopping full-time diagnostic data capture

I can connect to the config server with mongo on localhost:27019



 Comments   
Comment by Andy Schwerin [ 20/Nov/15 ]

Documentation should note that in 3.2, mongos during startup will continue trying to contact config servers until one becomes available, instead of shutting down if it cannot find a config server to talk to.

Comment by Githook User [ 18/Nov/15 ]

Author:

{u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@mongodb.com'}

Message: SERVER-21215 If the config servers are all down, retry sharding state initialization until they come back up
Branch: master
https://github.com/mongodb/mongo/commit/6ff8f1cfffec49ccccb3d505c1e9b804058f124f

Comment by Ramon Fernandez Marina [ 30/Oct/15 ]

Discussed live with Andy and Kal, who explained there are three bugs involved here. Since SERVER-21222 and SERVER-21223 are opened for two of those bugs I'm adjusting this ticket to take care of the infinite retry.

Comment by John Page [ 30/Oct/15 ]

Also mongod doesn't quit - just sits there running but refusing connections which would then not show up in any startup scripts.

Comment by John Page [ 30/Oct/15 ]

SOLVED - ISH

It appears that in 3.2 if you have a mongod which is a shard - then you cannot start it unless the config servers are up and running.

Previous to 3.2 you could startup the mongod servers first, then the config server then the mongos. You can still do this when first creating the cluster.

This is a significant behaviour change which could cause custoemr clusters that are not always on to fail to start - or a cluster to fail to start after a full reboot.

Therefore I'm still going to claim it's a bug.

Comment by John Page [ 30/Oct/15 ]

Full log

[ec2-user@ip-172-31-15-135 data]$ more /data/log2.log 
2015-10-30T10:23:04.789+0000 I -        [initandlisten] Detected data files in /data/shard2 created by the 'wiredTiger' storage engine, so setting the active storage engine to
 'wiredTiger'.
2015-10-30T10:23:04.790+0000 W -        [initandlisten] Detected unclean shutdown - /data/shard2/mongod.lock is not empty.
2015-10-30T10:23:04.790+0000 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-10-30T10:23:04.790+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(f
ast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),,log=(enabl
ed=false),
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] MongoDB starting : pid=38072 port=27102 dbpath=/data/shard2 64-bit host=ip-172-31-15-135
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] 
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] ** WARNING: You are running on a NUMA machine.
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] **          We suggest launching mongod like this to avoid performance problems:
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] **              numactl --interleave=all mongod [other options]
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] 
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] db version v3.2.0-rc1
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] git version: beabb900fa05c3b090fc62e887d41d9c43562b3f
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.0-fips 29 Mar 2010
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] allocator: tcmalloc
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] modules: enterprise 
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] build environment:
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten]     distmod: amzn64
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten]     distarch: x86_64
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten]     target_arch: x86_64
2015-10-30T10:23:04.842+0000 I CONTROL  [initandlisten] options: { net: { port: 27102 }, processManagement: { fork: true }, storage: { dbPath: "/data/shard2", journal: { enabl
ed: false }, wiredTiger: { collectionConfig: { blockCompressor: "zlib" }, engineConfig: { cacheSizeGB: 1 } } }, systemLog: { destination: "file", path: "/data/log2.log" } }
2015-10-30T10:23:04.845+0000 I FTDC     [initandlisten] Starting full-time diagnostic data capture with directory '/data/shard2/diagnostic.data'
2015-10-30T10:23:04.845+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2015-10-30T10:23:04.845+0000 I SHARDING [initandlisten] Sharding state recovery process found document { _id: "minOpTimeRecovery", configsvrConnectionString: "localhost:27019"
, shardName: "shard0001", minOpTime: { ts: Timestamp 0|0, t: -1 }, minOpTimeUpdaters: 0 }
2015-10-30T10:23:04.845+0000 I SHARDING [initandlisten] first cluster operation detected, adding sharding hook to enable versioning and authentication to remote servers
2015-10-30T10:23:04.845+0000 I SHARDING [initandlisten] Updating config server connection string to: localhost:27019
2015-10-30T10:23:04.846+0000 W NETWORK  [initandlisten] Failed to connect to 127.0.0.1:27019, reason: errno:111 Connection refused
2015-10-30T10:23:04.846+0000 I STORAGE  [initandlisten] exception in initAndListen: 13328 connection pool: connect failed localhost:27019 : couldn't connect to server localhos
t:27019, connection attempt failed, terminating
2015-10-30T10:23:04.846+0000 I FTDC     [initandlisten] Stopping full-time diagnostic data capture

Comment by John Page [ 30/Oct/15 ]

MongoD is running on 27102

[ec2-user@ip-172-31-15-135 data]$ ps axuww | grep mongod | grep 27102
ec2-user  38072  0.0  0.0 392040 48132 ?        Sl   10:23   0:00 mongod --wiredTigerCollectionBlockCompressor=zlib --dbpath=/data/shard2 --port=27102 --nojournal --logpath=/data/log2.log --wiredTigerCacheSizeGB=1 --fork

mongo shell CANNOT connect to it though

Generated at Thu Feb 08 03:56:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.