Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26987

Sharded find operations stall for several seconds during chunk migration

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major - P3
    • Resolution: Fixed
    • 3.2.11
    • 3.2.12
    • Sharding
    • None
    • Fully Compatible
    • ALL
    • Sharding 2017-01-02

    Description

      Note: this is different than SERVER-22611.

      WT nodes, running 3.2.10. During chunk migration - we notice severe drops in mongos query ability (from 500 req/sec to 0) - rendering the cluster unusable for several seconds . This happens when the chunk is being read from the source node, and exactly every 16mb of reads (we see a surge of disk reads at this time). Using a tester that tries to query both mongos and the source mongod at times in which the freeze happens, we noticed that only mongos is frozen - while the source mongod is responsive . To eliminate any hardware bottleneck - we used very very powerful machines, in mongos, config servers and mongods - in terms of CPU and disk IOPS.

      This is how a drop looks like (every line represents a second):

      444 Requests, 0 Failed (0 Timeouts)
      574 Requests, 0 Failed (0 Timeouts)
      516 Requests, 0 Failed (0 Timeouts)
      512 Requests, 0 Failed (0 Timeouts)
      463 Requests, 0 Failed (0 Timeouts)
      212 Requests, 0 Failed (0 Timeouts)
      28 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      203 Requests, 0 Failed (0 Timeouts)
      66 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      221 Requests, 0 Failed (0 Timeouts)
      69 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      320 Requests, 0 Failed (0 Timeouts)
      7 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      16 Requests, 0 Failed (0 Timeouts)
      98 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      0 Requests, 0 Failed (0 Timeouts)
      313 Requests, 0 Failed (0 Timeouts)
      783 Requests, 0 Failed (0 Timeouts)
      355 Requests, 0 Failed (0 Timeouts)
      682 Requests, 0 Failed (0 Timeouts)
      

      Only the balanced collection is affected during the freeze - and we see nothing special in the logs of either mongos and the source mongod when this happens. Secondary throttling does not help.

      This renders our cluster unusable during balancing - so currently our balancer is stopped. Any advice on the source of this issue and how we can address it will be extremely helpful. Thanks

      Attachments

        Activity

          People

            nathan.myers Nathan Myers
            yonido Yoni Douek
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: