Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26987

Sharded find operations stall for several seconds during chunk migration

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.2.12
    • Affects Version/s: 3.2.11
    • Component/s: Sharding
    • None
    • Fully Compatible
    • ALL
    • Sharding 2017-01-02

      Note: this is different than SERVER-22611.

      WT nodes, running 3.2.10. During chunk migration - we notice severe drops in mongos query ability (from 500 req/sec to 0) - rendering the cluster unusable for several seconds . This happens when the chunk is being read from the source node, and exactly every 16mb of reads (we see a surge of disk reads at this time). Using a tester that tries to query both mongos and the source mongod at times in which the freeze happens, we noticed that only mongos is frozen - while the source mongod is responsive . To eliminate any hardware bottleneck - we used very very powerful machines, in mongos, config servers and mongods - in terms of CPU and disk IOPS.

      This is how a drop looks like (every line represents a second):

      444 Requests, 0 Failed (0 Timeouts)
      574 Requests, 0 Failed (0 Timeouts)
      516 Requests, 0 Failed (0 Timeouts)
      512 Requests, 0 Failed (0 Timeouts)
      463 Requests, 0 Failed (0 Timeouts)
      212 Requests, 0 Failed (0 Timeouts)
      28 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      203 Requests, 0 Failed (0 Timeouts)
      66 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      221 Requests, 0 Failed (0 Timeouts)
      69 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      320 Requests, 0 Failed (0 Timeouts)
      7 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      16 Requests, 0 Failed (0 Timeouts)
      98 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      313 Requests, 0 Failed (0 Timeouts)
      783 Requests, 0 Failed (0 Timeouts)
      355 Requests, 0 Failed (0 Timeouts)
      682 Requests, 0 Failed (0 Timeouts)
      

      Only the balanced collection is affected during the freeze - and we see nothing special in the logs of either mongos and the source mongod when this happens. Secondary throttling does not help.

      This renders our cluster unusable during balancing - so currently our balancer is stopped. Any advice on the source of this issue and how we can address it will be extremely helpful. Thanks

            Assignee:
            nathan.myers Nathan Myers
            Reporter:
            yonido Yoni Douek
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated:
              Resolved: