Loading...

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.4.17
Component/s: Sharding
Labels:
None

Operating System:
ALL
Steps To Reproduce:

Hide

Upgrade from mongoDB cluster from 3.4.16 to mongos 3.4.17, and remote servers with a ping >99ms stopped connecting correctly to config servers.

Show
Upgrade from mongoDB cluster from 3.4.16 to mongos 3.4.17, and remote servers with a ping >99ms stopped connecting correctly to config servers.
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Hi,

A week ago, we upgraded our Shared Cluster, from MongoDB 3.4.16 to 3.4.17. The first step was migrating all mongos to 3.4.17, and later the replicasets and config servers.

Our infraestructure is:

3 Datacenters (2 in the same country, with a another one in another, the ping between the 2 datacenters in the same country is less than 1ms, but the remote one is 100ms).
All datacenters are connected with an stable VPN.
We have a replicaset with 8 servers, 4 replicasets, 4 config servers and 4 arbiters for the replicasets (not for config servers).
For connecting to the cluster, we have 120 servers, with a mongos installed on them.

The problem cames that all was working correctly before this upgrade, but after it, the remote mongos servers (with the ping of 100ms), apparently aren't able to connect to config servers.

I'm attaching the mongos log with network verbosity to 5 and diagnostic.data .

But basically when we execute a query, we get the next error:

mongos> db.getCollection('hadCompressed').find( { "chkin": "2018-10-11", "n": 1, "gid": 33794, "occ": "1::2-0/", "rtype": 1, "nid": 30200 } ); 

Error: error: { "ok" : 0, "errmsg" : "Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: \"chunks\" }", "code" : 50, "codeName" : "ExceededTimeLimit" }

In the log we get the next:

2018-10-10T10:10:24.040+0200 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Request 110115 timed out, adjusted timeout after getting connection from pool was 30000ms, op was id: 7409, states: [ UNINITIALIZED, IN_PROGRESS ], start_time: 2018-10-10T10:09:54.040+0200, request: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.040+0200 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Operation timing out; original request was: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.040+0200 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Failed to execute command: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } reason: NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.040+0200 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Request 110115 finished with response: NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.040+0200 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Ending connection to host mmhad03a:26040 due to bad connection status; 3 connections to that host remain open
2018-10-10T10:10:24.040+0200 I NETWORK  [CatalogCacheLoader-0] Marking host mmhad03a:26040 as failed :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.040+0200 I SHARDING [CatalogCacheLoader-0] Operation timed out  :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.040+0200 D -        [CatalogCacheLoader-0] User Assertion: 50:Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } src/mongo/s/config_server_catalog_cache_loader.cpp 112
2018-10-10T10:10:24.040+0200 D -        [CatalogCacheLoader-0] User Assertion: 50:Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } src/mongo/s/catalog_cache.cpp 82
2018-10-10T10:10:24.040+0200 I SHARDING [CatalogCacheLoader-0] Refresh for collection had.hadCompressed took 30409 ms and failed :: caused by :: ExceededTimeLimit: Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.041+0200 D EXECUTOR [CatalogCacheLoader-0] Reaping this thread; next thread reaped no earlier than 2018-10-10T10:10:54.041+0200
2018-10-10T10:10:24.041+0200 D EXECUTOR [CatalogCacheLoader-0] shutting down thread in pool CatalogCacheLoader

This is error is not happening on the same or nearest datacenter in which the cluster is deployed.

This is our config servers configuration:

security:
   keyFile: /etc/.mongoshardkey

storage:
  dbPath: /mongodb-data/mongodb-cfg
  journal:
    enabled: true

systemLog:
#  verbosity: 5
  destination: file
  logAppend: true
  path: /var/log/mongodb/mongod-cfg.log

net:
  port: 26040
  bindIp: 0.0.0.0
  compression:
      compressors: snappy

replication:
  replSetName: cfg0

sharding:
  clusterRole: "configsvr"

This is our mongos configuration (the server have 32 cores and 74 threads, we tried other ShardingTaskExecutor values, with no luck (and before upgrading was working perfect):

systemLog:
  verbosity: 2 
  destination: file
  logAppend: true
  path: /usr/local/mongodb/log/mongos.log
  component:
   network:
    verbosity: 5

net:
  port: 27017
  bindIp: 0.0.0.0
  compression:
      compressors: snappy

setParameter:
   ShardingTaskExecutorPoolMaxSize: 8
   ShardingTaskExecutorPoolMinSize: 4
   ShardingTaskExecutorPoolHostTimeoutMS: 30000

sharding:
   configDB: cfg0/mmhad01a:26040,mmhad02a:26040,mmhad03a:26040,mmhad04a:26040

security:
   keyFile: /etc/.mongoshardkey

Thanks in advance

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

diagnosticdata.tar.gz
28.45 MB
Oct 10 2018 08:49:48 AM UTC
mongos.log.tar.gz
130.98 MB
Oct 10 2018 08:52:08 AM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates