[SERVER-37540] Operation timed out between config servers and mongos after upgrading from 3.4.16 to 3.4.17 Created: 10/Oct/18  Updated: 26/Oct/18  Resolved: 26/Oct/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.17
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Roberto Rodriguez Assignee: Danny Hatcher (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File diagnosticdata.tar.gz     File mongos.log.tar.gz    
Operating System: ALL
Steps To Reproduce:

Upgrade from mongoDB cluster from 3.4.16 to mongos 3.4.17, and remote servers with a ping >99ms stopped connecting correctly to config servers.

Participants:

 Description   

Hi,

A week ago, we upgraded our Shared Cluster, from MongoDB 3.4.16 to 3.4.17. The first step was migrating all mongos to 3.4.17, and later the replicasets and config servers.

Our infraestructure is:

  • 3 Datacenters (2 in the same country, with a another one in another, the ping between the 2 datacenters in the same country is less than 1ms, but the remote one is 100ms).
  • All datacenters are connected with an stable VPN.
  • We have a replicaset with 8 servers, 4 replicasets, 4 config servers and 4 arbiters for the replicasets (not for config servers).
  • For connecting to the cluster, we have 120 servers, with a mongos installed on them.

The problem cames that all was working correctly before this upgrade, but after it, the remote mongos servers (with the ping of 100ms), apparently aren't able to connect to config servers.

I'm attaching the mongos log with network verbosity to 5 and diagnostic.data .

But basically when we execute a query, we get the next error:

mongos> db.getCollection('hadCompressed').find( { "chkin": "2018-10-11", "n": 1, "gid": 33794, "occ": "1::2-0/", "rtype": 1, "nid": 30200 } ); 
 
Error: error: { "ok" : 0, "errmsg" : "Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: \"chunks\" }", "code" : 50, "codeName" : "ExceededTimeLimit" }

In the log we get the next:

2018-10-10T10:10:24.040+0200 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Request 110115 timed out, adjusted timeout after getting connection from pool was 30000ms, op was id: 7409, states: [ UNINITIALIZED, IN_PROGRESS ], start_time: 2018-10-10T10:09:54.040+0200, request: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.040+0200 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Operation timing out; original request was: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.040+0200 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Failed to execute command: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } reason: NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.040+0200 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Request 110115 finished with response: NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.040+0200 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Ending connection to host mmhad03a:26040 due to bad connection status; 3 connections to that host remain open
2018-10-10T10:10:24.040+0200 I NETWORK  [CatalogCacheLoader-0] Marking host mmhad03a:26040 as failed :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.040+0200 I SHARDING [CatalogCacheLoader-0] Operation timed out  :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.040+0200 D -        [CatalogCacheLoader-0] User Assertion: 50:Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } src/mongo/s/config_server_catalog_cache_loader.cpp 112
2018-10-10T10:10:24.040+0200 D -        [CatalogCacheLoader-0] User Assertion: 50:Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } src/mongo/s/catalog_cache.cpp 82
2018-10-10T10:10:24.040+0200 I SHARDING [CatalogCacheLoader-0] Refresh for collection had.hadCompressed took 30409 ms and failed :: caused by :: ExceededTimeLimit: Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
2018-10-10T10:10:24.041+0200 D EXECUTOR [CatalogCacheLoader-0] Reaping this thread; next thread reaped no earlier than 2018-10-10T10:10:54.041+0200
2018-10-10T10:10:24.041+0200 D EXECUTOR [CatalogCacheLoader-0] shutting down thread in pool CatalogCacheLoader

This is error is not happening on the same or nearest datacenter in which the cluster is deployed.

This is our config servers configuration:

security:
   keyFile: /etc/.mongoshardkey
 
storage:
  dbPath: /mongodb-data/mongodb-cfg
  journal:
    enabled: true
 
systemLog:
#  verbosity: 5
  destination: file
  logAppend: true
  path: /var/log/mongodb/mongod-cfg.log
 
net:
  port: 26040
  bindIp: 0.0.0.0
  compression:
      compressors: snappy
 
replication:
  replSetName: cfg0
 
sharding:
  clusterRole: "configsvr"

This is our mongos configuration (the server have 32 cores and 74 threads, we tried other ShardingTaskExecutor values, with no luck (and before upgrading was working perfect):

systemLog:
  verbosity: 2 
  destination: file
  logAppend: true
  path: /usr/local/mongodb/log/mongos.log
  component:
   network:
    verbosity: 5
 
net:
  port: 27017
  bindIp: 0.0.0.0
  compression:
      compressors: snappy
 
setParameter:
   ShardingTaskExecutorPoolMaxSize: 8
   ShardingTaskExecutorPoolMinSize: 4
   ShardingTaskExecutorPoolHostTimeoutMS: 30000
 
sharding:
   configDB: cfg0/mmhad01a:26040,mmhad02a:26040,mmhad03a:26040,mmhad04a:26040
 
security:
   keyFile: /etc/.mongoshardkey

Thanks in advance



 Comments   
Comment by Danny Hatcher (Inactive) [ 26/Oct/18 ]

Roberto,

It is possible that the mongos processes got "locked onto" a "dead" config server during the upgrade process. Generally, we recommend upgrading the mongos processes last when bringing a sharded cluster to a newer version. This ensures that the mongos processes have an up-to-date view of the cluster post-upgrade.

I'm going to close this ticket for now but if you do encounter this problem again, please let us know.

Thank you,

Danny

Comment by Roberto Rodriguez [ 26/Oct/18 ]

Hi,

The logs rotated and I don't have this logs at the moment. I solved it, installing two config servers in the datacenter with problems. But it's strange, that after upgrading mongos version I didn't have this problems.

Thanks

Comment by Danny Hatcher (Inactive) [ 24/Oct/18 ]

Hello Roberto,

I apologize for the delay in response. Thank you for your detail in the initial description; it is much appreciated. It looks like there was an issue at connection establishment time but the errors present are on the getmore, not the initial find. Do you happen to have logs from the Primary config server at the time of the logs above? If not, and the problem is still occurring, could you attach updated logs from the problem mongos and the Primary config server?

Thanks,

Danny

Generated at Thu Feb 08 04:46:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.