Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26987

Sharded find operations stall for several seconds during chunk migration

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 3.2.11
    • Fix Version/s: 3.2.12
    • Component/s: Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Sprint:
      Sharding 2017-01-02

      Description

      Note: this is different than SERVER-22611.

      WT nodes, running 3.2.10. During chunk migration - we notice severe drops in mongos query ability (from 500 req/sec to 0) - rendering the cluster unusable for several seconds . This happens when the chunk is being read from the source node, and exactly every 16mb of reads (we see a surge of disk reads at this time). Using a tester that tries to query both mongos and the source mongod at times in which the freeze happens, we noticed that only mongos is frozen - while the source mongod is responsive . To eliminate any hardware bottleneck - we used very very powerful machines, in mongos, config servers and mongods - in terms of CPU and disk IOPS.

      This is how a drop looks like (every line represents a second):

      444 Requests, 0 Failed (0 Timeouts)
      574 Requests, 0 Failed (0 Timeouts)
      516 Requests, 0 Failed (0 Timeouts)
      512 Requests, 0 Failed (0 Timeouts)
      463 Requests, 0 Failed (0 Timeouts)
      212 Requests, 0 Failed (0 Timeouts)
      28 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      203 Requests, 0 Failed (0 Timeouts)
      66 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      221 Requests, 0 Failed (0 Timeouts)
      69 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      320 Requests, 0 Failed (0 Timeouts)
      7 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      16 Requests, 0 Failed (0 Timeouts)
      98 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      313 Requests, 0 Failed (0 Timeouts)
      783 Requests, 0 Failed (0 Timeouts)
      355 Requests, 0 Failed (0 Timeouts)
      682 Requests, 0 Failed (0 Timeouts)
      

      Only the balanced collection is affected during the freeze - and we see nothing special in the logs of either mongos and the source mongod when this happens. Secondary throttling does not help.

      This renders our cluster unusable during balancing - so currently our balancer is stopped. Any advice on the source of this issue and how we can address it will be extremely helpful. Thanks

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: