Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-57846

Balancer hanging

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Done
    • Affects Version/s: 4.4.5
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Operating System:
      ALL

      Description

      I have a Sharded Cluster and the Balancer seems to hang, I have several unbalanced collections:

      db.getSiblingDB("config").chunks.aggregate([
         { $match: { ns: { $nin: ["config.system.sessions"] } } },
         { $group: { _id: { shard: "$shard", ns: "$ns" }, chunks: { $sum: 1 } } },
         { $group: { _id: "$_id.ns", data: { $push: { k: "$_id.shard", v: "$chunks" } } } },
         { $replaceRoot: { newRoot: { $mergeObjects: [{ $arrayToObject: "$data" }, { ns: "$_id" }] } } }
         { $sort: { ns: 1 } }
      ])
       
      { "shard_03" : 1794, "shard_02" : 1794, "shard_01" : 1794, "shard_04" : 1794, "ns" : "data.sessions.20210606" }
      { "shard_03" : 1509, "shard_04" : 1508, "shard_02" : 1508, "shard_01" : 1508, "ns" : "data.sessions.20210607" }
      { "shard_04" : 1912, "shard_03" : 1911, "shard_02" : 1912, "shard_01" : 1911, "ns" : "data.sessions.20210608" }
      { "shard_03" : 2019, "shard_04" : 2019, "shard_01" : 2019, "shard_02" : 2018, "ns" : "data.sessions.20210609" }
      { "shard_01" : 1977, "shard_03" : 1977, "shard_04" : 1977, "shard_02" : 1977, "ns" : "data.sessions.20210610" }
      { "shard_03" : 1300, "shard_01" : 1300, "shard_04" : 1300, "shard_02" : 1299, "ns" : "data.sessions.20210611" }
      { "shard_02" : 1841, "shard_03" : 1840, "shard_04" : 1841, "shard_01" : 1841, "ns" : "data.sessions.20210612" }
      { "shard_04" : 2030, "shard_01" : 2029, "shard_03" : 2029, "shard_02" : 2030, "ns" : "data.sessions.20210613" }
      { "shard_02" : 1496, "shard_04" : 2273, "shard_01" : 2484, "shard_03" : 1708, "ns" : "data.sessions.20210615" }
      { "shard_03" : 2841, "shard_04" : 1179, "shard_01" : 2366, "shard_02" : 1333, "ns" : "data.sessions.20210616" }
      { "shard_01" : 8156, "ns" : "data.sessions.20210617" }
      { "shard_01" : 2967, "ns" : "data.sessions.20210618" }
      { "shard_01" : 10, "ns" : "data.sessions.20210619" }
      { "shard_01" : 10, "ns" : "data.sessions.20210620" }
      { "shard_01" : 10, "ns" : "data.sessions.20210621" }
      { "shard_01" : 224, "shard_04" : 199, "shard_02" : 1170, "shard_03" : 332, "ns" : "ignored.sessions.20210615" }
      { "shard_02" : 1148, "shard_04" : 315, "shard_01" : 218, "shard_03" : 237, "ns" : "ignored.sessions.20210616" }
      { "shard_02" : 1950, "ns" : "ignored.sessions.20210617" }
      { "shard_04" : 1, "shard_02" : 845, "ns" : "ignored.sessions.20210618" }
      { "shard_02" : 10, "ns" : "ignored.sessions.20210619" }
      { "shard_02" : 10, "ns" : "ignored.sessions.20210620" }
      { "shard_02" : 10, "ns" : "ignored.sessions.20210621" }
      { "shard_02" : 139, "shard_01" : 134, "shard_04" : 127, "shard_03" : 128, "ns" : "mip.statistics" }
      

      Sharding status is like this. Apparently MongoDB hangs while balancing collection "mip.statistics"

      sh.status()--- Sharding Status --- 
        sharding version: {
        	"_id" : 1,
        	"minCompatibleVersion" : 5,
        	"currentVersion" : 6,
        	"clusterId" : ObjectId("608864f0e8dcb6218857ab2d")
        }
        shards:
              {  "_id" : "shard_01",  "host" : "shard_01/d-mipmdb-sh1-01.swi.srse.net:27018,d-mipmdb-sh2-01.swi.srse.net:27018",  "state" : 1,  "tags" : [ ] }
              {  "_id" : "shard_02",  "host" : "shard_02/d-mipmdb-sh1-02.swi.srse.net:27018,d-mipmdb-sh2-02.swi.srse.net:27018",  "state" : 1,  "tags" : [ ] }
              {  "_id" : "shard_03",  "host" : "shard_03/d-mipmdb-sh1-03.swi.srse.net:27018,d-mipmdb-sh2-03.swi.srse.net:27018",  "state" : 1,  "tags" : [ ] }
              {  "_id" : "shard_04",  "host" : "shard_04/d-mipmdb-sh1-04.swi.srse.net:27018,d-mipmdb-sh2-04.swi.srse.net:27018",  "state" : 1,  "tags" : [ ] }
        active mongoses:
              "4.4.3" : 16
              "4.4.5" : 2
        autosplit:
              Currently enabled: yes
        balancer:
              Currently enabled:  yes
              Currently running:  no
                      Balancer active window is set between 02:10 and 01:50 server local time
              Collections with active migrations: 
                      mip.statistics started at Fri Jun 18 2021 11:13:38 GMT+0200 (W. Europe Daylight Time)
              Failed balancer rounds in last 5 attempts:  5
              Last reported error:  Could not find host matching read preference { mode: "primary" } for set shard_01
              Time of Reported error:  Fri Jun 18 2021 10:24:32 GMT+0200 (W. Europe Daylight Time)
              Migration Results for the last 24 hours: 
                      3 : Success
                      1 : Failed with error 'aborted', from shard_02 to shard_04
        databases: 
       
             {  "_id" : "mip",  "primary" : "shard_01",  "partitioned" : true,  "version" : {  "uuid" : UUID("4c4d4777-1a9e-4fd8-9b73-f579e1b5a83a"),  "lastMod" : 1 } }
                      mip.statistics
                              shard key: { "ts" : "hashed" }
                              unique: false
                              balancing: true
                              chunks:
                                      shard_01	134
                                      shard_02	139
                                      shard_03	128
                                      shard_04	127
                              too many chunks to print, use verbose if you want to force print
      
      

      As a quick solution I tried to drop the culprit collection but no success:

      db.statistics.drop()
       
      Error: drop failed: {
      	"ok" : 0,
      	"errmsg" : "timed out waiting for mip.statistics",
      	"code" : 46,
      	"codeName" : "LockBusy",
      	"operationTime" : Timestamp(1624017860, 1),
      	"$clusterTime" : {
      		"clusterTime" : Timestamp(1624017860, 1),
      		"signature" : {
      			"hash" : BinData(0,"lkY1zw1m1Zv/rqUaVsAWCkGrzjI="),
      			"keyId" : NumberLong("6955920606428659733")
      		}
      	}
      } :
      _getErrorWithCode@src/mongo/shell/utils.js:25:13
      DBCollection.prototype.drop@src/mongo/shell/collection.js:713:15
      @(shell):1:1
      

      I can insert or delete data from this collection, drop and create indexes but dropping it is not possible.

      I also stopped/started the Balancer - no success

      I even restarted the entire Sharded Cluster - no success either

       

       

       

       

        Attachments

          Activity

            People

            Assignee:
            eric.sedor Eric Sedor
            Reporter:
            wernfried.domscheit@sunrise.net Wernfried Domscheit
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: