Details
-
Bug
-
Resolution: Done
-
Major - P3
-
None
-
3.4.17
-
None
-
ALL
-
Description
Hi,
A week ago, we upgraded our Shared Cluster, from MongoDB 3.4.16 to 3.4.17. The first step was migrating all mongos to 3.4.17, and later the replicasets and config servers.
Our infraestructure is:
- 3 Datacenters (2 in the same country, with a another one in another, the ping between the 2 datacenters in the same country is less than 1ms, but the remote one is 100ms).
- All datacenters are connected with an stable VPN.
- We have a replicaset with 8 servers, 4 replicasets, 4 config servers and 4 arbiters for the replicasets (not for config servers).
- For connecting to the cluster, we have 120 servers, with a mongos installed on them.
The problem cames that all was working correctly before this upgrade, but after it, the remote mongos servers (with the ping of 100ms), apparently aren't able to connect to config servers.
I'm attaching the mongos log with network verbosity to 5 and diagnostic.data .
But basically when we execute a query, we get the next error:
mongos> db.getCollection('hadCompressed').find( { "chkin": "2018-10-11", "n": 1, "gid": 33794, "occ": "1::2-0/", "rtype": 1, "nid": 30200 } ); |
|
|
Error: error: { "ok" : 0, "errmsg" : "Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: \"chunks\" }", "code" : 50, "codeName" : "ExceededTimeLimit" } |
In the log we get the next:
2018-10-10T10:10:24.040+0200 D ASIO [NetworkInterfaceASIO-ShardRegistry-0] Request 110115 timed out, adjusted timeout after getting connection from pool was 30000ms, op was id: 7409, states: [ UNINITIALIZED, IN_PROGRESS ], start_time: 2018-10-10T10:09:54.040+0200, request: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } |
2018-10-10T10:10:24.040+0200 D ASIO [NetworkInterfaceASIO-ShardRegistry-0] Operation timing out; original request was: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } |
2018-10-10T10:10:24.040+0200 D ASIO [NetworkInterfaceASIO-ShardRegistry-0] Failed to execute command: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } reason: NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } |
2018-10-10T10:10:24.040+0200 D ASIO [NetworkInterfaceASIO-ShardRegistry-0] Request 110115 finished with response: NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } |
2018-10-10T10:10:24.040+0200 I ASIO [NetworkInterfaceASIO-ShardRegistry-0] Ending connection to host mmhad03a:26040 due to bad connection status; 3 connections to that host remain open |
2018-10-10T10:10:24.040+0200 I NETWORK [CatalogCacheLoader-0] Marking host mmhad03a:26040 as failed :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } |
2018-10-10T10:10:24.040+0200 I SHARDING [CatalogCacheLoader-0] Operation timed out :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } |
2018-10-10T10:10:24.040+0200 D - [CatalogCacheLoader-0] User Assertion: 50:Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } src/mongo/s/config_server_catalog_cache_loader.cpp 112 |
2018-10-10T10:10:24.040+0200 D - [CatalogCacheLoader-0] User Assertion: 50:Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } src/mongo/s/catalog_cache.cpp 82 |
2018-10-10T10:10:24.040+0200 I SHARDING [CatalogCacheLoader-0] Refresh for collection had.hadCompressed took 30409 ms and failed :: caused by :: ExceededTimeLimit: Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } |
2018-10-10T10:10:24.041+0200 D EXECUTOR [CatalogCacheLoader-0] Reaping this thread; next thread reaped no earlier than 2018-10-10T10:10:54.041+0200 |
2018-10-10T10:10:24.041+0200 D EXECUTOR [CatalogCacheLoader-0] shutting down thread in pool CatalogCacheLoader |
This is error is not happening on the same or nearest datacenter in which the cluster is deployed.
This is our config servers configuration:
security:
|
keyFile: /etc/.mongoshardkey
|
|
|
storage:
|
dbPath: /mongodb-data/mongodb-cfg
|
journal:
|
enabled: true |
|
|
systemLog:
|
# verbosity: 5 |
destination: file
|
logAppend: true |
path: /var/log/mongodb/mongod-cfg.log
|
|
|
net:
|
port: 26040 |
bindIp: 0.0.0.0 |
compression:
|
compressors: snappy
|
|
|
replication:
|
replSetName: cfg0
|
|
|
sharding:
|
clusterRole: "configsvr" |
This is our mongos configuration (the server have 32 cores and 74 threads, we tried other ShardingTaskExecutor values, with no luck (and before upgrading was working perfect):
systemLog:
|
verbosity: 2 |
destination: file
|
logAppend: true |
path: /usr/local/mongodb/log/mongos.log
|
component:
|
network:
|
verbosity: 5 |
|
|
net:
|
port: 27017 |
bindIp: 0.0.0.0 |
compression:
|
compressors: snappy
|
|
|
setParameter:
|
ShardingTaskExecutorPoolMaxSize: 8 |
ShardingTaskExecutorPoolMinSize: 4 |
ShardingTaskExecutorPoolHostTimeoutMS: 30000 |
|
|
sharding:
|
configDB: cfg0/mmhad01a:26040,mmhad02a:26040,mmhad03a:26040,mmhad04a:26040 |
|
|
security:
|
keyFile: /etc/.mongoshardkey
|
Thanks in advance