Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26722

router blocks and throws ExceededTimeLimit

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 3.2.10
    • Fix Version/s: None
    • Component/s: Querying, Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Sprint:
      Sharding 2016-11-21, Sharding 2016-12-12
    • Case:

      Description

      We have updated one of our sharded cluster from v3.0.12 to v3.2.10. Since then our cluster was not operational because many operations got blocked by the the router. The corresponding log message looks like this:

      2016-10-20T11:00:22.902+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to s559:27017 - ExceededTimeLimit: Operation timed out
      2016-10-20T11:00:22.918+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-007.ipx:27017 - ExceededTimeLimit: Operation timed out
      2016-10-20T11:00:22.920+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-024.ipx:27017 - ExceededTimeLimit: Operation timed out
      2016-10-20T11:00:22.921+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-007.ipx:27017 - ExceededTimeLimit: Operation timed out
      2016-10-20T11:00:22.921+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-024.ipx:27017 - ExceededTimeLimit: Operation timed out
      

      We can reproduce the issue at any time just by executing a findOne through the router several times:

      for(x=0;x<1000;x++){db.offer.find({"_id" : NumberLong("5672494983")}).forEach(function(u){printjson(u)});print(x)}
      

      It blocks after a few findOne's already.
      If we execute the same code on the shard where the document is located then there is no blocking at all.

      We found out that mongodb router v3.0.12 doesn't have this problem. This is why we downgraded all our routers to v3.0.12 even though the rest of the cluster (mongod's) is running v3.2.10.

      Please see attached the log file from the router.
      Please see also 3 monitoring screenshots of the router TCP-sockets. As you can see, tcp_tw (tcp_timeWait) is much higher for v3.2.10 than for v3.0.12.

        Attachments

        1. fr-11_tcpwait.jpg
          fr-11_tcpwait.jpg
          173 kB
        2. fr-11_tcpwaitOnly.jpg
          fr-11_tcpwaitOnly.jpg
          169 kB
        3. offerstore-en-router-03_afterDowngradeToV3.0.10.png
          offerstore-en-router-03_afterDowngradeToV3.0.10.png
          201 kB
        4. offerstore-en-router-03_afterUpgradeToV3.2.10.png
          offerstore-en-router-03_afterUpgradeToV3.2.10.png
          201 kB
        5. offerstore-en-router-03_beforeUpgrade.png
          offerstore-en-router-03_beforeUpgrade.png
          202 kB
        6. offerstore-en-router-03_testWithV3.2.10.png
          offerstore-en-router-03_testWithV3.2.10.png
          232 kB
        7. offerstore-en-router-03.ipx.2016-10-20.log.tgz
          9.46 MB
        8. offerstore-en-router-03.ipx.2016-10-26.log.tgz
          3.74 MB
        9. v3.2.11 - 201.zip
          102.86 MB
        10. v3.2.11 - 202.zip
          16.96 MB
        11. v3.2.11 - 203.zip
          34.03 MB

          Issue Links

            Activity

              People

              • Votes:
                2 Vote for this issue
                Watchers:
                17 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: