-
Type:
Bug
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: 3.4.17
-
Component/s: Sharding
-
None
-
ALL
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Hi,
A week ago, we upgraded our Shared Cluster, from MongoDB 3.4.16 to 3.4.17. The first step was migrating all mongos to 3.4.17, and later the replicasets and config servers.
Our infraestructure is:
- 3 Datacenters (2 in the same country, with a another one in another, the ping between the 2 datacenters in the same country is less than 1ms, but the remote one is 100ms).
- All datacenters are connected with an stable VPN.
- We have a replicaset with 8 servers, 4 replicasets, 4 config servers and 4 arbiters for the replicasets (not for config servers).
- For connecting to the cluster, we have 120 servers, with a mongos installed on them.
The problem cames that all was working correctly before this upgrade, but after it, the remote mongos servers (with the ping of 100ms), apparently aren't able to connect to config servers.
I'm attaching the mongos log with network verbosity to 5 and diagnostic.data .
But basically when we execute a query, we get the next error:
mongos> db.getCollection('hadCompressed').find( { "chkin": "2018-10-11", "n": 1, "gid": 33794, "occ": "1::2-0/", "rtype": 1, "nid": 30200 } ); Error: error: { "ok" : 0, "errmsg" : "Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: \"chunks\" }", "code" : 50, "codeName" : "ExceededTimeLimit" }
In the log we get the next:
2018-10-10T10:10:24.040+0200 D ASIO [NetworkInterfaceASIO-ShardRegistry-0] Request 110115 timed out, adjusted timeout after getting connection from pool was 30000ms, op was id: 7409, states: [ UNINITIALIZED, IN_PROGRESS ], start_time: 2018-10-10T10:09:54.040+0200, request: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } 2018-10-10T10:10:24.040+0200 D ASIO [NetworkInterfaceASIO-ShardRegistry-0] Operation timing out; original request was: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } 2018-10-10T10:10:24.040+0200 D ASIO [NetworkInterfaceASIO-ShardRegistry-0] Failed to execute command: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } reason: NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } 2018-10-10T10:10:24.040+0200 D ASIO [NetworkInterfaceASIO-ShardRegistry-0] Request 110115 finished with response: NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } 2018-10-10T10:10:24.040+0200 I ASIO [NetworkInterfaceASIO-ShardRegistry-0] Ending connection to host mmhad03a:26040 due to bad connection status; 3 connections to that host remain open 2018-10-10T10:10:24.040+0200 I NETWORK [CatalogCacheLoader-0] Marking host mmhad03a:26040 as failed :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } 2018-10-10T10:10:24.040+0200 I SHARDING [CatalogCacheLoader-0] Operation timed out :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } 2018-10-10T10:10:24.040+0200 D - [CatalogCacheLoader-0] User Assertion: 50:Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } src/mongo/s/config_server_catalog_cache_loader.cpp 112 2018-10-10T10:10:24.040+0200 D - [CatalogCacheLoader-0] User Assertion: 50:Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } src/mongo/s/catalog_cache.cpp 82 2018-10-10T10:10:24.040+0200 I SHARDING [CatalogCacheLoader-0] Refresh for collection had.hadCompressed took 30409 ms and failed :: caused by :: ExceededTimeLimit: Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } 2018-10-10T10:10:24.041+0200 D EXECUTOR [CatalogCacheLoader-0] Reaping this thread; next thread reaped no earlier than 2018-10-10T10:10:54.041+0200 2018-10-10T10:10:24.041+0200 D EXECUTOR [CatalogCacheLoader-0] shutting down thread in pool CatalogCacheLoader
This is error is not happening on the same or nearest datacenter in which the cluster is deployed.
This is our config servers configuration:
security: keyFile: /etc/.mongoshardkey storage: dbPath: /mongodb-data/mongodb-cfg journal: enabled: true systemLog: # verbosity: 5 destination: file logAppend: true path: /var/log/mongodb/mongod-cfg.log net: port: 26040 bindIp: 0.0.0.0 compression: compressors: snappy replication: replSetName: cfg0 sharding: clusterRole: "configsvr"
This is our mongos configuration (the server have 32 cores and 74 threads, we tried other ShardingTaskExecutor values, with no luck (and before upgrading was working perfect):
systemLog:
verbosity: 2
destination: file
logAppend: true
path: /usr/local/mongodb/log/mongos.log
component:
network:
verbosity: 5
net:
port: 27017
bindIp: 0.0.0.0
compression:
compressors: snappy
setParameter:
ShardingTaskExecutorPoolMaxSize: 8
ShardingTaskExecutorPoolMinSize: 4
ShardingTaskExecutorPoolHostTimeoutMS: 30000
sharding:
configDB: cfg0/mmhad01a:26040,mmhad02a:26040,mmhad03a:26040,mmhad04a:26040
security:
keyFile: /etc/.mongoshardkey
Thanks in advance