Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-27393

Balancer taking 100% CPU due to large number of dropped sharded collections

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 3.4.0
    • Fix Version/s: 3.4.2, 3.5.2
    • Component/s: Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v3.4
    • Sprint:
      Sharding 2017-01-02

      Description

      Balancer started to take 100% CPU after upgrading to 3.4.0 from 3.2.9 and enabling the balancer. This is a DB with 4 shards (rs1, rs2, rs3 and rs4), but before upgrading we removed one shard (rs5), waiting until the drain completed.

      In the log I can see warnings like these for several collections:

      2016-12-13T09:52:11.211+0000 W SHARDING [Balancer] Unable to enforce tag range policy for collection eplus.wifiCollection_20161008 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161008
      2016-12-13T09:52:13.087+0000 W SHARDING [Balancer] Unable to enforce tag range policy for collection eplus.wifiCollection_20161009 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161009
      2016-12-13T09:53:38.583+0000 W SHARDING [Balancer] Unable to balance collection eplus.wifiCollection_20161008 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161008
      2016-12-13T09:53:40.360+0000 W SHARDING [Balancer] Unable to balance collection eplus.wifiCollection_20161009 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161009
      

      Those collections are created and dropped after some days. And indeed those collections were dropped and are not shown in "db.getCollectionNames()".

      I investigated a bit and found those collections in config DB:

      { "_id" : "eplus.wifiCollection_20161008", "lastmodEpoch" : ObjectId("000000000000000000000000"), "lastmod" : ISODate("2016-10-18T04:00:13.108Z"), "dropped" : true }
      { "_id" : "eplus.wifiCollection_20161009", "lastmodEpoch" : ObjectId("000000000000000000000000"), "lastmod" : ISODate("2016-10-19T04:00:48.158Z"), "dropped" : true }
      

      And there are locks for many collections related to the removed shard (rs5):

      { "_id" : "eplus.wifiCollection_20160908", "state" : 0, "ts" : ObjectId("5837ee01c839440f1e70d384"), "who" : "wifi-db-05a:27018:1475838481:-1701389523:conn104", "process" : "wifi-db-05a:27018:1475838481:-1701389523", "when" : ISODate("2016-11-25T07:53:37.235Z"), "why" : "migrating chunk [{ lineId: 8915926302292949940 }, { lineId: MaxKey }) in eplus.wifiCollection_20160908" }
      { "_id" : "eplus.wifiCollection_20160909", "state" : 0, "ts" : ObjectId("5837ee01c839440f1e70d38b"), "who" : "wifi-db-05a:27018:1475838481:-1701389523:conn104", "process" : "wifi-db-05a:27018:1475838481:-1701389523", "when" : ISODate("2016-11-25T07:53:37.296Z"), "why" : "migrating chunk [{ lineId: 8915926302292949940 }, { lineId: MaxKey }) in eplus.wifiCollection_20160909" }
      

      Not only there are locks for dropped collections but also for existant collections. Our guessing is that this is causing Balancer to continuously loop over all collections, and thus causing 100% CPU, but we are not sure how to work around.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              nathan.myers Nathan Myers
              Reporter:
              icruz Isaac Cruz
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: