Operation timed out between config servers and mongos after upgrading from 3.4.16 to 3.4.17

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: 3.4.17
    • Component/s: Sharding
    • None
    • ALL
    • Hide

      Upgrade from mongoDB cluster from 3.4.16 to mongos 3.4.17, and remote servers with a ping >99ms stopped connecting correctly to config servers.

      Show
      Upgrade from mongoDB cluster from 3.4.16 to mongos 3.4.17, and remote servers with a ping >99ms stopped connecting correctly to config servers.
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Hi,

      A week ago, we upgraded our Shared Cluster, from MongoDB 3.4.16 to 3.4.17. The first step was migrating all mongos to 3.4.17, and later the replicasets and config servers.

      Our infraestructure is:

      • 3 Datacenters (2 in the same country, with a another one in another, the ping between the 2 datacenters in the same country is less than 1ms, but the remote one is 100ms).
      • All datacenters are connected with an stable VPN.
      • We have a replicaset with 8 servers, 4 replicasets, 4 config servers and 4 arbiters for the replicasets (not for config servers).
      • For connecting to the cluster, we have 120 servers, with a mongos installed on them.

      The problem cames that all was working correctly before this upgrade, but after it, the remote mongos servers (with the ping of 100ms), apparently aren't able to connect to config servers.

      I'm attaching the mongos log with network verbosity to 5 and diagnostic.data .

      But basically when we execute a query, we get the next error:

      mongos> db.getCollection('hadCompressed').find( { "chkin": "2018-10-11", "n": 1, "gid": 33794, "occ": "1::2-0/", "rtype": 1, "nid": 30200 } ); 
      
      Error: error: { "ok" : 0, "errmsg" : "Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: \"chunks\" }", "code" : 50, "codeName" : "ExceededTimeLimit" }
      

      In the log we get the next:

      2018-10-10T10:10:24.040+0200 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Request 110115 timed out, adjusted timeout after getting connection from pool was 30000ms, op was id: 7409, states: [ UNINITIALIZED, IN_PROGRESS ], start_time: 2018-10-10T10:09:54.040+0200, request: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
      2018-10-10T10:10:24.040+0200 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Operation timing out; original request was: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
      2018-10-10T10:10:24.040+0200 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Failed to execute command: RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } reason: NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
      2018-10-10T10:10:24.040+0200 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Request 110115 finished with response: NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
      2018-10-10T10:10:24.040+0200 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Ending connection to host mmhad03a:26040 due to bad connection status; 3 connections to that host remain open
      2018-10-10T10:10:24.040+0200 I NETWORK  [CatalogCacheLoader-0] Marking host mmhad03a:26040 as failed :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
      2018-10-10T10:10:24.040+0200 I SHARDING [CatalogCacheLoader-0] Operation timed out  :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
      2018-10-10T10:10:24.040+0200 D -        [CatalogCacheLoader-0] User Assertion: 50:Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } src/mongo/s/config_server_catalog_cache_loader.cpp 112
      2018-10-10T10:10:24.040+0200 D -        [CatalogCacheLoader-0] User Assertion: 50:Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" } src/mongo/s/catalog_cache.cpp 82
      2018-10-10T10:10:24.040+0200 I SHARDING [CatalogCacheLoader-0] Refresh for collection had.hadCompressed took 30409 ms and failed :: caused by :: ExceededTimeLimit: Failed to load chunks due to Operation timed out, request was RemoteCommand 110115 -- target:mmhad03a:26040 db:config expDate:2018-10-10T10:10:24.040+0200 cmd:{ getMore: 54542999558, collection: "chunks" }
      2018-10-10T10:10:24.041+0200 D EXECUTOR [CatalogCacheLoader-0] Reaping this thread; next thread reaped no earlier than 2018-10-10T10:10:54.041+0200
      2018-10-10T10:10:24.041+0200 D EXECUTOR [CatalogCacheLoader-0] shutting down thread in pool CatalogCacheLoader
      

      This is error is not happening on the same or nearest datacenter in which the cluster is deployed.

      This is our config servers configuration:

      security:
         keyFile: /etc/.mongoshardkey
      
      storage:
        dbPath: /mongodb-data/mongodb-cfg
        journal:
          enabled: true
      
      systemLog:
      #  verbosity: 5
        destination: file
        logAppend: true
        path: /var/log/mongodb/mongod-cfg.log
      
      net:
        port: 26040
        bindIp: 0.0.0.0
        compression:
            compressors: snappy
      
      replication:
        replSetName: cfg0
      
      sharding:
        clusterRole: "configsvr"
      

      This is our mongos configuration (the server have 32 cores and 74 threads, we tried other ShardingTaskExecutor values, with no luck (and before upgrading was working perfect):

      systemLog:
        verbosity: 2 
        destination: file
        logAppend: true
        path: /usr/local/mongodb/log/mongos.log
        component:
         network:
          verbosity: 5
      
      net:
        port: 27017
        bindIp: 0.0.0.0
        compression:
            compressors: snappy
      
      setParameter:
         ShardingTaskExecutorPoolMaxSize: 8
         ShardingTaskExecutorPoolMinSize: 4
         ShardingTaskExecutorPoolHostTimeoutMS: 30000
      
      sharding:
         configDB: cfg0/mmhad01a:26040,mmhad02a:26040,mmhad03a:26040,mmhad04a:26040
      
      security:
         keyFile: /etc/.mongoshardkey
      

      Thanks in advance

        1. diagnosticdata.tar.gz
          28.45 MB
          Roberto Rodriguez
        2. mongos.log.tar.gz
          130.98 MB
          Roberto Rodriguez

            Assignee:
            Danny Hatcher (Inactive)
            Reporter:
            Roberto Rodriguez
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: