Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26722

router blocks and throws ExceededTimeLimit

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 3.2.10
    • Fix Version/s: None
    • Component/s: Querying, Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Sprint:
      Sharding 2016-11-21, Sharding 2016-12-12
    • Case:

      Description

      We have updated one of our sharded cluster from v3.0.12 to v3.2.10. Since then our cluster was not operational because many operations got blocked by the the router. The corresponding log message looks like this:

      2016-10-20T11:00:22.902+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to s559:27017 - ExceededTimeLimit: Operation timed out
      2016-10-20T11:00:22.918+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-007.ipx:27017 - ExceededTimeLimit: Operation timed out
      2016-10-20T11:00:22.920+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-024.ipx:27017 - ExceededTimeLimit: Operation timed out
      2016-10-20T11:00:22.921+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-007.ipx:27017 - ExceededTimeLimit: Operation timed out
      2016-10-20T11:00:22.921+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-024.ipx:27017 - ExceededTimeLimit: Operation timed out
      

      We can reproduce the issue at any time just by executing a findOne through the router several times:

      for(x=0;x<1000;x++){db.offer.find({"_id" : NumberLong("5672494983")}).forEach(function(u){printjson(u)});print(x)}
      

      It blocks after a few findOne's already.
      If we execute the same code on the shard where the document is located then there is no blocking at all.

      We found out that mongodb router v3.0.12 doesn't have this problem. This is why we downgraded all our routers to v3.0.12 even though the rest of the cluster (mongod's) is running v3.2.10.

      Please see attached the log file from the router.
      Please see also 3 monitoring screenshots of the router TCP-sockets. As you can see, tcp_tw (tcp_timeWait) is much higher for v3.2.10 than for v3.0.12.

        Attachments

        1. fr-11_tcpwait.jpg
          173 kB
          kay.agahd
        2. fr-11_tcpwaitOnly.jpg
          169 kB
          kay.agahd
        3. offerstore-en-router-03_afterDowngradeToV3.0.10.png
          201 kB
          kay.agahd
        4. offerstore-en-router-03_afterUpgradeToV3.2.10.png
          201 kB
          kay.agahd
        5. offerstore-en-router-03_beforeUpgrade.png
          202 kB
          kay.agahd
        6. offerstore-en-router-03_testWithV3.2.10.png
          232 kB
          kay.agahd
        7. offerstore-en-router-03.ipx.2016-10-20.log.tgz
          9.46 MB
          kay.agahd
        8. offerstore-en-router-03.ipx.2016-10-26.log.tgz
          3.74 MB
          kay.agahd

          Issue Links

            Activity

              People

              Votes:
              2 Vote for this issue
              Watchers:
              17 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: