[SERVER-23862] mongos looses /fails to get updated list from config server repeatedly Created: 21/Apr/16  Updated: 22/Apr/16  Resolved: 22/Apr/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.4
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: pavan Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

On AWS mongos seems to fail connecting to mongo config server after shard servers are added.
The setup is a test environment in aws

Config Server and the Mongos are running on two different machines in the same network ( different subnets )

Config Server # rsconf1/172.31..26.252 --> only 1 node in the replica set
Mongs service # 172.31.24.126 --> pointing to config server above

Shard1 # rs1/172.31.43.146
Shard2 # rs2/172.31.43.147
Shard # rs3/172.31.45.148

Only Shard1 add via mongos.

Mongs successfully connects to config server and then immediately fail to retrieve the list of shards

2016-04-21T21:57:05.979+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to 172.31.26.252:27017
2016-04-21T21:57:06.188+0000 I SHARDING [Balancer] caught exception while doing balance: could not get updated shard list from config server due to Operation timed out
2016-04-21T21:57:06.189+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-24-126-2016-04-21T21:57:06.189+0000-57194cb2534cef5db40b623e", server: "ip-172-31-24-126", clientAddr: "", time: new Date(1461275826189), what: "balancer.round", ns: "", details: { executionTimeMillis: 30003, errorOccured: true, errmsg: "could not get updated shard list from config server due to Operation timed out" } }

---------------------------- $$$$$$$$$$$$$$$$$$$$$$$$ -------------------------

the operation times out on the config server , the same query passes when run directly connected to mongo config server ..

2016-04-21T21:59:08.690+0000 I COMMAND  [conn29] command config.$cmd command: find { find: "shards", readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1461274573000|2, t: 5 } }, maxTimeMS: 30000 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:92 locks:{} protocol:op_command 30401ms
2016-04-21T21:59:08.690+0000 I NETWORK  [conn29] end connection 172.31.24.126:44130 (4 connections now open)
2016-04-21T21:59:38.084+0000 I NETWORK  [initandlisten] connection accepted from 172.31.24.126:44133 #32 (5 connections now open)

----------------------------------- $$$$$$$$$$$$$$$$ -----------------------------

Note: mongo config server is running on an EBS backed storage. Even then it should take 30 seconds to return the sharded list, the same query when run on the terminal locally connected returns in fraction of a second.



 Comments   
Comment by Ramon Fernandez Marina [ 22/Apr/16 ]

ppeddada, please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group.

You should have three nodes in your config server replica set; if you run into a bug in such configuration please let us know in this ticket or feel free to open a new one.

See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by pavan [ 22/Apr/16 ]

Where is mongos service getting the stored value from ?, it seems the last time the stored one is different from what has been supplied at mongos start

Comment by pavan [ 22/Apr/16 ]

For development and tests we are not running a full mongo config replica set, it seems mongos is doing a read concern of 'Majority', didn't see anywhere it can be set to 'Primary' - looks like this is more of a driver level or mongo shell level preference. So instead of running a replica set with just one node, tried it with no replicas - simple standalone one node mongo config server, the server starts fine, while connecting from mongos it first warns of connecting to a non replica config server and continues to throw error saying ...

---------------------------------------------------------------------------------

root@ip-172-31-24-126:/home/deploy# mongos --configdb '172.31.5.248:27017'
2016-04-22T00:08:56.603+0000 W SHARDING [main] Running a sharded cluster with fewer than 3 config servers should only be done for testing purposes and is not recommended for production.
2016-04-22T00:08:56.607+0000 I CONTROL  [main] ** WARNING: You are running this process as the root user, which is not recommended.
2016-04-22T00:08:56.607+0000 I CONTROL  [main] 
2016-04-22T00:08:56.607+0000 I SHARDING [mongosMain] MongoS version 3.2.4 starting: pid=2481 port=27017 64-bit host=ip-172-31-24-126 (--help for usage)
2016-04-22T00:08:56.607+0000 I CONTROL  [mongosMain] db version v3.2.4
2016-04-22T00:08:56.607+0000 I CONTROL  [mongosMain] git version: e2ee9ffcf9f5a94fad76802e28cc978718bb7a30
2016-04-22T00:08:56.607+0000 I CONTROL  [mongosMain] OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
2016-04-22T00:08:56.607+0000 I CONTROL  [mongosMain] allocator: tcmalloc
2016-04-22T00:08:56.607+0000 I CONTROL  [mongosMain] modules: none
2016-04-22T00:08:56.607+0000 I CONTROL  [mongosMain] build environment:
2016-04-22T00:08:56.607+0000 I CONTROL  [mongosMain]     distmod: ubuntu1404
2016-04-22T00:08:56.607+0000 I CONTROL  [mongosMain]     distarch: x86_64
2016-04-22T00:08:56.607+0000 I CONTROL  [mongosMain]     target_arch: x86_64
2016-04-22T00:08:56.607+0000 I CONTROL  [mongosMain] options: { sharding: { configDB: "172.31.5.248:27017" } }
2016-04-22T00:08:56.607+0000 I SHARDING [mongosMain] Updating config server connection string to: 172.31.5.248:27017
2016-04-22T00:08:56.612+0000 I SHARDING [LockPinger] creating distributed lock ping thread for 172.31.5.248:27017 and process ip-172-31-24-126:27017:1461283736:1648390757 (sleeping for 30000ms)
2016-04-22T00:08:56.614+0000 I NETWORK  [mongosMain] Starting new replica set monitor for rs1/172.31.43.146:27017
2016-04-22T00:08:56.614+0000 I NETWORK  [ReplicaSetMonitorWatcher] starting
2016-04-22T00:08:56.616+0000 I SHARDING [LockPinger] cluster 172.31.5.248:27017 pinged successfully at 2016-04-22T00:08:56.614+0000 by distributed lock pinger '172.31.5.248:27017/ip-172-31-24-126:27017:1461283736:1648390757', sleeping for 30000ms
2016-04-22T00:08:56.632+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-04-22T00:08:56.632+0000 I SHARDING [Balancer] about to contact config servers and shards
2016-04-22T00:08:56.639+0000 W SHARDING [Balancer] could not initialize balancer, please check that all shards and config servers are up: mongos specified a different config database string : stored : rsconf1/172.31.26.252:27017 vs given : 172.31.5.248:27017
2016-04-22T00:08:56.640+0000 I SHARDING [Balancer] will retry to initialize balancer in one minute
2016-04-22T00:08:56.655+0000 I NETWORK  [mongosMain] waiting for connections on port 27017
2016-04-22T00:08:59.310+0000 I NETWORK  [mongosMain] connection accepted from 172.31.34.108:48383 #1 (1 connection now open)
2016-04-22T00:09:26.618+0000 I SHARDING [LockPinger] cluster 172.31.5.248:27017 pinged successfully at 2016-04-22T00:09:26.616+0000 by distributed lock pinger '172.31.5.248:27017/ip-172-31-24-126:27017:1461283736:1648390757', sleeping for 30000ms
2016-04-22T00:09:56.621+0000 I SHARDING [LockPinger] cluster 172.31.5.248:27017 pinged successfully at 2016-04-22T00:09:56.618+0000 by distributed lock pinger '172.31.5.248:27017/ip-172-31-24-126:27017:1461283736:1648390757', sleeping for 30000ms
2016-04-22T00:09:56.640+0000 I SHARDING [Balancer] about to contact config servers and shards
2016-04-22T00:09:56.644+0000 W SHARDING [Balancer] could not initialize balancer, please check that all shards and config servers are up: mongos specified a different config database string : stored : rsconf1/172.31.26.252:27017 vs given : 172.31.5.248:27017
2016-04-22T00:09:56.644+0000 I SHARDING [Balancer] will retry to initialize balancer in one minute 

Comment by pavan [ 21/Apr/16 ]

local retrieval of shard info on config server – no timeout or whatsoever - so where is the timeout coming from when the same request is being queried from mongos service ...
-------------------------------------------------------------------------

root@ip-172-31-26-252:/var/log/mongodb/log# mongo --host localhost:27017
MongoDB shell version: 3.2.4
connecting to: localhost:27017/test
Server has startup warnings: 
2016-04-21T22:07:06.111+0000 I CONTROL  [initandlisten] 
2016-04-21T22:07:06.111+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2016-04-21T22:07:06.111+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2016-04-21T22:07:06.111+0000 I CONTROL  [initandlisten] 
2016-04-21T22:07:06.111+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2016-04-21T22:07:06.111+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2016-04-21T22:07:06.111+0000 I CONTROL  [initandlisten] 
rsconf1:PRIMARY> use config
switched to db config
rsconf1:PRIMARY> db.shards.find({})
{ "_id" : "rs1", "host" : "rs1/172.31.43.146:27017" }
rsconf1:PRIMARY> 

Comment by pavan [ 21/Apr/16 ]

As soon as the first shard is added, Balancer catches an exception

Mongo Config server can see the shard server

curl 172.31.45.148:27017
It looks like you are trying to access MongoDB over HTTP on the native driver port.

Mongos Service can see the config server as evident from the last line below, but still mongos service fails to retrieve shard list / mongo config server has a corresponding timeout.

------------------------------------------------------------------------------------------------

2016-04-21T22:11:46.286+0000 I NETWORK  [conn2] Starting new replica set monitor for rs1/172.31.43.146:27017
2016-04-21T22:11:46.293+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-TaskExecutor-0] Successfully connected to 172.31.43.146:27017
2016-04-21T22:11:46.297+0000 I SHARDING [conn2] going to add shard: { _id: "rs1", host: "rs1/172.31.43.146:27017" }
2016-04-21T22:11:46.307+0000 I SHARDING [conn2] about to log metadata event into changelog: { _id: "ip-172-31-24-126-2016-04-21T22:11:46.307+0000-571950225c279de1396f88ac", server: "ip-172-31-24-126", clientAddr: "172.31.34.108:48069", time: new Date(1461276706307), what: "addShard", ns: "", details: { name: "rs1", host: "rs1/172.31.43.146:27017" } }
2016-04-21T22:11:53.835+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to 172.31.43.146:27017
2016-04-21T22:11:53.838+0000 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 571950295c279de1396f88ad
2016-04-21T22:12:23.838+0000 W SHARDING [Balancer] Failed to retrieve the set of collections during balancing round ExceededTimeLimit: Operation timed out
2016-04-21T22:12:23.838+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-24-126-2016-04-21T22:12:23.838+0000-571950475c279de1396f88ae", server: "ip-172-31-24-126", clientAddr: "", time: new Date(1461276743838), what: "balancer.round", ns: "", details: { executionTimeMillis: 30011, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
2016-04-21T22:12:23.842+0000 I SHARDING [Balancer] distributed lock with ts: 571950295c279de1396f88ad' unlocked.
2016-04-21T22:13:03.456+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to 172.31.26.252:27017
2016-04-21T22:13:03.846+0000 I SHARDING [Balancer] caught exception while doing balance: could not get updated shard list from config server due to Operation timed out
2016-04-21T22:13:03.846+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-24-126-2016-04-21T22:13:03.846+0000-5719506f5c279de1396f88af", server: "ip-172-31-24-126", clientAddr: "", time: new Date(1461276783846), what: "balancer.round", ns: "", details: { executionTimeMillis: 30002, errorOccured: true, errmsg: "could not get updated shard list from config server due to Operation timed out" } }
2016-04-21T22:13:33.459+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to 172.31.26.252:27017
2016-04-21T22:13:43.850+0000 I SHARDING [Balancer] caught exception while doing balance: could not get updated shard list from config server due to Operation timed out
2016-04-21T22:13:43.851+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-24-126-2016-04-21T22:13:43.851+0000-571950975c279de1396f88b0", server: "ip-172-31-24-126", clientAddr: "", time: new Date(1461276823851), what: "balancer.round", ns: "", details: { executionTimeMillis: 30002, errorOccured: true, errmsg: "could not get updated shard list from config server due to Operation timed out" } }
2016-04-21T22:14:03.462+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to 172.31.26.252:27017

Comment by pavan [ 21/Apr/16 ]

with no shards the balancer has no exceptions,i.e. is able to successfully connect to config

-------------------------------------------------------------------

root@ip-172-31-24-126:/home/deploy# mongos --configdb 'rsconf1/172.31.26.252:27017'
2016-04-21T22:08:03.407+0000 W SHARDING [main] Running a sharded cluster with fewer than 3 config servers should only be done for testing purposes and is not recommended for production.
2016-04-21T22:08:03.411+0000 I CONTROL  [main] ** WARNING: You are running this process as the root user, which is not recommended.
2016-04-21T22:08:03.411+0000 I CONTROL  [main] 
2016-04-21T22:08:03.411+0000 I SHARDING [mongosMain] MongoS version 3.2.4 starting: pid=2126 port=27017 64-bit host=ip-172-31-24-126 (--help for usage)
2016-04-21T22:08:03.411+0000 I CONTROL  [mongosMain] db version v3.2.4
2016-04-21T22:08:03.411+0000 I CONTROL  [mongosMain] git version: e2ee9ffcf9f5a94fad76802e28cc978718bb7a30
2016-04-21T22:08:03.411+0000 I CONTROL  [mongosMain] OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
2016-04-21T22:08:03.411+0000 I CONTROL  [mongosMain] allocator: tcmalloc
2016-04-21T22:08:03.411+0000 I CONTROL  [mongosMain] modules: none
2016-04-21T22:08:03.411+0000 I CONTROL  [mongosMain] build environment:
2016-04-21T22:08:03.411+0000 I CONTROL  [mongosMain]     distmod: ubuntu1404
2016-04-21T22:08:03.411+0000 I CONTROL  [mongosMain]     distarch: x86_64
2016-04-21T22:08:03.411+0000 I CONTROL  [mongosMain]     target_arch: x86_64
2016-04-21T22:08:03.411+0000 I CONTROL  [mongosMain] options: { sharding: { configDB: "rsconf1/172.31.26.252:27017" } }
2016-04-21T22:08:03.411+0000 I SHARDING [mongosMain] Updating config server connection string to: rsconf1/172.31.26.252:27017
2016-04-21T22:08:03.411+0000 I NETWORK  [mongosMain] Starting new replica set monitor for rsconf1/172.31.26.252:27017
2016-04-21T22:08:03.411+0000 I NETWORK  [ReplicaSetMonitorWatcher] starting
2016-04-21T22:08:03.413+0000 I SHARDING [thread1] creating distributed lock ping thread for process ip-172-31-24-126:27017:1461276483:75212657 (sleeping for 30000ms)
2016-04-21T22:08:03.416+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to 172.31.26.252:27017
2016-04-21T22:08:03.416+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to 172.31.26.252:27017
2016-04-21T22:08:03.427+0000 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: LockStateChangeFailed: findAndModify query predicate didn't match any lock document
2016-04-21T22:08:03.553+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-04-21T22:08:03.553+0000 I SHARDING [Balancer] about to contact config servers and shards
2016-04-21T22:08:03.555+0000 I SHARDING [Balancer] config servers and shards contacted successfully
2016-04-21T22:08:03.555+0000 I SHARDING [Balancer] balancer id: ip-172-31-24-126:27017 started
2016-04-21T22:08:03.575+0000 I NETWORK  [mongosMain] waiting for connections on port 27017
2016-04-21T22:08:03.581+0000 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 57194f435c279de1396f887e
2016-04-21T22:08:03.590+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-24-126-2016-04-21T22:08:03.590+0000-57194f435c279de1396f887f", server: "ip-172-31-24-126", clientAddr: "", time: new Date(1461276483590), what: "balancer.round", ns: "", details: { executionTimeMillis: 26, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
2016-04-21T22:08:03.593+0000 I SHARDING [Balancer] distributed lock with ts: 57194f435c279de1396f887e' unlocked.
2016-04-21T22:08:05.892+0000 I NETWORK  [mongosMain] connection accepted from 172.31.34.108:48068 #1 (1 connection now open)
2016-04-21T22:08:13.605+0000 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 57194f4d5c279de1396f8880
2016-04-21T22:08:13.606+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-24-126-2016-04-21T22:08:13.606+0000-57194f4d5c279de1396f8881", server: "ip-172-31-24-126", clientAddr: "", time: new Date(1461276493606), what: "balancer.round", ns: "", details: { executionTimeMillis: 5, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
2016-04-21T22:08:13.608+0000 I SHARDING [Balancer] distributed lock with ts: 57194f4d5c279de1396f8880' unlocked.
2016-04-21T22:08:23.615+0000 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 57194f575c279de1396f8882
2016-04-21T22:08:23.616+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-24-126-2016-04-21T22:08:23.616+0000-57194f575c279de1396f8883", server: "ip-172-31-24-126", clientAddr: "", time: new Date(1461276503616), what: "balancer.round", ns: "", details: { executionTimeMillis: 6, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
2016-04-21T22:08:23.618+0000 I SHARDING [Balancer] distributed lock with ts: 57194f575c279de1396f8882' unlocked.
2016-04-21T22:08:33.624+0000 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 57194f615c279de1396f8884
2016-04-21T22:08:33.624+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-24-126-2016-04-21T22:08:33.624+0000-57194f615c279de1396f8885", server: "ip-172-31-24-126", clientAddr: "", time: new Date(1461276513624), what: "balancer.round", ns: "", details: { executionTimeMillis: 5, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
2016-04-21T22:08:33.627+0000 I SHARDING [Balancer] distributed lock with ts: 57194f615c279de1396f8884' unlocked.
2016-04-21T22:08:43.633+0000 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 57194f6b5c279de1396f8886
2016-04-21T22:08:43.634+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-24-126-2016-04-21T22:08:43.634+0000-57194f6b5c279de1396f8887", server: "ip-172-31-24-126", clientAddr: "", time: new Date(1461276523634), what: "balancer.round", ns: "", details: { executionTimeMillis: 4, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
2016-04-21T22:08:43.636+0000 I SHARDING [Balancer] distributed lock with ts: 57194f6b5c279de1396f8886' unlocked.
2016-04-21T22:08:53.642+0000 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 57194f755c279de1396f8888
2016-04-21T22:08:53.643+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-24-126-2016-04-21T22:08:53.643+0000-57194f755c279de1396f8889", server: "ip-172-31-24-126", clientAddr: "", time: new Date(1461276533643), what: "balancer.round", ns: "", details: { executionTimeMillis: 4, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
2016-04-21T22:08:53.645+0000 I SHARDING [Balancer] distributed lock with ts: 57194f755c279de1396f8888' unlocked.
2016-04-21T22:09:03.650+0000 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 57194f7f5c279de1396f888a
2016-04-21T22:09:03.651+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-24-126-2016-04-21T22:09:03.651+0000-57194f7f5c279de1396f888b", server: "ip-172-31-24-126", clientAddr: "", time: new Date(1461276543651), what: "balancer.round", ns: "", details: { executionTimeMillis: 4, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
2016-04-21T22:09:03.653+0000 I SHARDING [Balancer] distributed lock with ts: 57194f7f5c279de1396f888a' unlocked.
2016-04-21T22:09:13.658+0000 I SHARDING [Balancer] distributed lock 'balancer' acquired for 'doing balance round', ts : 57194f895c279de1396f888c
2016-04-21T22:09:13.659+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-172-31-24-126-2016-04-21T22:09:13.659+0000-57194f895c279de1396f888d", server: "ip-172-31-24-126", clientAddr: "", time: new Date(1461276553659), what: "balancer.round", ns: "", details: { executionTimeMillis: 5, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
2016-04-21T22:09:13.661+0000 I SHARDING [Balancer] distributed lock with ts: 57194f895c279de1396f888c' unlocked.

Generated at Thu Feb 08 04:04:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.