Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28629

router blocks and throws ExceededTimeLimit

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 3.2.12
    • Fix Version/s: None
    • Component/s: Networking, Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      We can reproduce the issue at any time just by executing a findOne through the router several times:

      for(x=0;x<1000;x++){db.offer.find({"_id" : NumberLong("5672494983")}).forEach(function(u){printjson(u)});print(x)}
      

      It blocks after a few findOne's already.
      If we execute the same code on the shard where the document is located then there is no blocking at all.

      Show
      We can reproduce the issue at any time just by executing a findOne through the router several times: for(x=0;x<1000;x++){db.offer.find({"_id" : NumberLong("5672494983")}).forEach(function(u){printjson(u)});print(x)} It blocks after a few findOne's already. If we execute the same code on the shard where the document is located then there is no blocking at all.

      Description

      We are using a new sharded cluster running v3.2.12. Our cluster is not operational because many operations get blocked by the the router. The corresponding log message looks like this:

      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-070.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-073.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-067.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-070.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-067.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-073.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-067.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-073.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-066.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-070.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      

      We observe this behaviour independent on whether the query uses the shardkey or not. In all cases the queried field is indexed.

      A downgrade of the routers to v3.0.12 ist not possible because our configservers are running as replicaset instead of a mirrored set.
      An upgrade of the routers to v3.4.3 is not possible because "Version 3.4 mongos instances cannot connect to earlier versions of mongod instances."
      https://docs.mongodb.com/manual/release-notes/3.4-compatibility/

      Please see also 2 monitoring screenshots of the router TCP-sockets. As you can see, tcp_tw (tcp_timeWait) is very high.

      This ticket is related to SERVER-26722 which has been closed as "resolved and fixed in 3.2.12" but since we still have this issue, we've create this new ticket for it.

        Attachments

        1. figure_1.png
          figure_1.png
          63 kB
        2. fr-11_tcpwait.jpg
          fr-11_tcpwait.jpg
          173 kB
        3. fr-11_tcpwaitOnly.jpg
          fr-11_tcpwaitOnly.jpg
          169 kB
        4. tcp_timewait_3.2.12vs3.2.8.jpg
          tcp_timewait_3.2.12vs3.2.8.jpg
          174 kB
        5. tcp-tw_v3.0.12.jpg
          tcp-tw_v3.0.12.jpg
          123 kB
        6. v3.2.12_latencies.jpg
          v3.2.12_latencies.jpg
          236 kB
        7. v3.2.12_tcp_tw.jpg
          v3.2.12_tcp_tw.jpg
          214 kB
        8. v3.2.8_latencies.jpg
          v3.2.8_latencies.jpg
          237 kB
        9. v3.2.8_tcp_tw.jpg
          v3.2.8_tcp_tw.jpg
          194 kB

          Issue Links

            Activity

              People

              • Votes:
                4 Vote for this issue
                Watchers:
                20 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: