[SERVER-57846] Balancer hanging Created: 18/Jun/21  Updated: 29/Jun/21  Resolved: 29/Jun/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Wernfried Domscheit Assignee: Eric Sedor
Resolution: Done Votes: 0
Labels: sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

I have a Sharded Cluster and the Balancer seems to hang, I have several unbalanced collections:

db.getSiblingDB("config").chunks.aggregate([
   { $match: { ns: { $nin: ["config.system.sessions"] } } },
   { $group: { _id: { shard: "$shard", ns: "$ns" }, chunks: { $sum: 1 } } },
   { $group: { _id: "$_id.ns", data: { $push: { k: "$_id.shard", v: "$chunks" } } } },
   { $replaceRoot: { newRoot: { $mergeObjects: [{ $arrayToObject: "$data" }, { ns: "$_id" }] } } }
   { $sort: { ns: 1 } }
])
 
{ "shard_03" : 1794, "shard_02" : 1794, "shard_01" : 1794, "shard_04" : 1794, "ns" : "data.sessions.20210606" }
{ "shard_03" : 1509, "shard_04" : 1508, "shard_02" : 1508, "shard_01" : 1508, "ns" : "data.sessions.20210607" }
{ "shard_04" : 1912, "shard_03" : 1911, "shard_02" : 1912, "shard_01" : 1911, "ns" : "data.sessions.20210608" }
{ "shard_03" : 2019, "shard_04" : 2019, "shard_01" : 2019, "shard_02" : 2018, "ns" : "data.sessions.20210609" }
{ "shard_01" : 1977, "shard_03" : 1977, "shard_04" : 1977, "shard_02" : 1977, "ns" : "data.sessions.20210610" }
{ "shard_03" : 1300, "shard_01" : 1300, "shard_04" : 1300, "shard_02" : 1299, "ns" : "data.sessions.20210611" }
{ "shard_02" : 1841, "shard_03" : 1840, "shard_04" : 1841, "shard_01" : 1841, "ns" : "data.sessions.20210612" }
{ "shard_04" : 2030, "shard_01" : 2029, "shard_03" : 2029, "shard_02" : 2030, "ns" : "data.sessions.20210613" }
{ "shard_02" : 1496, "shard_04" : 2273, "shard_01" : 2484, "shard_03" : 1708, "ns" : "data.sessions.20210615" }
{ "shard_03" : 2841, "shard_04" : 1179, "shard_01" : 2366, "shard_02" : 1333, "ns" : "data.sessions.20210616" }
{ "shard_01" : 8156, "ns" : "data.sessions.20210617" }
{ "shard_01" : 2967, "ns" : "data.sessions.20210618" }
{ "shard_01" : 10, "ns" : "data.sessions.20210619" }
{ "shard_01" : 10, "ns" : "data.sessions.20210620" }
{ "shard_01" : 10, "ns" : "data.sessions.20210621" }
{ "shard_01" : 224, "shard_04" : 199, "shard_02" : 1170, "shard_03" : 332, "ns" : "ignored.sessions.20210615" }
{ "shard_02" : 1148, "shard_04" : 315, "shard_01" : 218, "shard_03" : 237, "ns" : "ignored.sessions.20210616" }
{ "shard_02" : 1950, "ns" : "ignored.sessions.20210617" }
{ "shard_04" : 1, "shard_02" : 845, "ns" : "ignored.sessions.20210618" }
{ "shard_02" : 10, "ns" : "ignored.sessions.20210619" }
{ "shard_02" : 10, "ns" : "ignored.sessions.20210620" }
{ "shard_02" : 10, "ns" : "ignored.sessions.20210621" }
{ "shard_02" : 139, "shard_01" : 134, "shard_04" : 127, "shard_03" : 128, "ns" : "mip.statistics" }

Sharding status is like this. Apparently MongoDB hangs while balancing collection "mip.statistics"

sh.status()--- Sharding Status --- 
  sharding version: {
  	"_id" : 1,
  	"minCompatibleVersion" : 5,
  	"currentVersion" : 6,
  	"clusterId" : ObjectId("608864f0e8dcb6218857ab2d")
  }
  shards:
        {  "_id" : "shard_01",  "host" : "shard_01/d-mipmdb-sh1-01.swi.srse.net:27018,d-mipmdb-sh2-01.swi.srse.net:27018",  "state" : 1,  "tags" : [ ] }
        {  "_id" : "shard_02",  "host" : "shard_02/d-mipmdb-sh1-02.swi.srse.net:27018,d-mipmdb-sh2-02.swi.srse.net:27018",  "state" : 1,  "tags" : [ ] }
        {  "_id" : "shard_03",  "host" : "shard_03/d-mipmdb-sh1-03.swi.srse.net:27018,d-mipmdb-sh2-03.swi.srse.net:27018",  "state" : 1,  "tags" : [ ] }
        {  "_id" : "shard_04",  "host" : "shard_04/d-mipmdb-sh1-04.swi.srse.net:27018,d-mipmdb-sh2-04.swi.srse.net:27018",  "state" : 1,  "tags" : [ ] }
  active mongoses:
        "4.4.3" : 16
        "4.4.5" : 2
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  no
                Balancer active window is set between 02:10 and 01:50 server local time
        Collections with active migrations: 
                mip.statistics started at Fri Jun 18 2021 11:13:38 GMT+0200 (W. Europe Daylight Time)
        Failed balancer rounds in last 5 attempts:  5
        Last reported error:  Could not find host matching read preference { mode: "primary" } for set shard_01
        Time of Reported error:  Fri Jun 18 2021 10:24:32 GMT+0200 (W. Europe Daylight Time)
        Migration Results for the last 24 hours: 
                3 : Success
                1 : Failed with error 'aborted', from shard_02 to shard_04
  databases: 
 
       {  "_id" : "mip",  "primary" : "shard_01",  "partitioned" : true,  "version" : {  "uuid" : UUID("4c4d4777-1a9e-4fd8-9b73-f579e1b5a83a"),  "lastMod" : 1 } }
                mip.statistics
                        shard key: { "ts" : "hashed" }
                        unique: false
                        balancing: true
                        chunks:
                                shard_01	134
                                shard_02	139
                                shard_03	128
                                shard_04	127
                        too many chunks to print, use verbose if you want to force print

As a quick solution I tried to drop the culprit collection but no success:

db.statistics.drop()
 
Error: drop failed: {
	"ok" : 0,
	"errmsg" : "timed out waiting for mip.statistics",
	"code" : 46,
	"codeName" : "LockBusy",
	"operationTime" : Timestamp(1624017860, 1),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1624017860, 1),
		"signature" : {
			"hash" : BinData(0,"lkY1zw1m1Zv/rqUaVsAWCkGrzjI="),
			"keyId" : NumberLong("6955920606428659733")
		}
	}
} :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
DBCollection.prototype.drop@src/mongo/shell/collection.js:713:15
@(shell):1:1

I can insert or delete data from this collection, drop and create indexes but dropping it is not possible.

I also stopped/started the Balancer - no success

I even restarted the entire Sharded Cluster - no success either

 

 

 

 



 Comments   
Comment by Eric Sedor [ 29/Jun/21 ]

Understood wernfried.domscheit@sunrise.net; I'll close this ticket but if this or a similar issue occurs again we'll be happy to take another look.

Sincerely,
Eric

Comment by Wernfried Domscheit [ 29/Jun/21 ]

The new deployment is running fine. And the old one was running fine for months.

One reason could be our backup. The host are all virtual and once a night they go offline for 1-2 seconds for filesystem snapshot. I set balancer activeWindow accordingly, if this issue was caused by our backup, it should not occur again.

 

Comment by Eric Sedor [ 28/Jun/21 ]

Hi wernfried.domscheit@sunrise.net,

From the logs for this config server it looks like one balancer failure was due to an issue with shard_01:

{"t":{"$date":"2021-06-18T08:22:32.093Z"},"s":"I","c":"SHARDING","id":21865,"ctx":"Balancer","msg":"Error while doing balance","attr":{"error":{"code":133,"codeName":"FailedToSatisfyReadPreference","errmsg":"Could not find host matching read preference { mode: \"primary\" } for set shard_01"}}}

Two were caused by a failover on the config server replica set (note the InterruptedDueToReplStateChange error):

{"t":{"$date":"2021-06-18T08:24:51.659Z"},"s":"W","c":"SHARDING","id":22081,"ctx":"Balancer","msg":"Error encountered while logging config change","attr":{"changeId":"d-mipmdb-cfg-01.swi.srse.net:27019-2021-06-18T10:24:51.656+02:00-60cc5853dc973a5246302c17","namespace":"actionlog","error":"InterruptedDueToReplStateChange: operation was interrupted"}}
...
{"t":{"$date":"2021-06-18T09:10:56.920Z"},"s":"I","c":"SHARDING","id":21872,"ctx":"Balancer","msg":"Migration failed","attr":{"migrateInfo":"mip.statistics: [{ ts: -9104079724831202657 }, { ts: -9099717584793900739 }), from shard_01, to shard_03","error":"BalancerInterrupted: Migration interrupted because the balancer is stopping. Command status: CallbackCanceled: Command canceled; original request was: RemoteCommand 129 -- target:[d-mipmdb-sh2-01.swi.srse.net:27018] db:admin cmd:{ moveChunk: \"mip.statistics\", shardVersion: [ Timestamp(176, 0), ObjectId('60886530cf1f88b1784cb7c0') ], epoch: ObjectId('60886530cf1f88b1784cb7c0'), configdb: \"configRepSet/d-mipmdb-cfg-01.swi.srse.net:27019,d-mipmdb-cfg-02.swi.srse.net:27019,d-mipmdb-cfg-03.swi.srse.net:27019\", fromShard: \"shard_01\", toShard: \"shard_03\", min: { ts: -9104079724831202657 }, max: { ts: -9099717584793900739 }, maxChunkSizeBytes: 67108864, waitForDelete: false, forceJumbo: 0, takeDistLock: false, writeConcern: {} }"}}
{"t":{"$date":"2021-06-18T09:10:56.920Z"},"s":"I","c":"SHARDING","id":22078,"ctx":"Balancer","msg":"Couldn't create config.actionlog collection","attr":{"error":{"code":11602,"codeName":"InterruptedDueToReplStateChange","errmsg":"operation was interrupted"}}}

I do see a window of time in these logs where I'm not sure why we don't see balancer activity (from 2021-06-18T13:19:01.345Z to 2021-06-18T21:59:01.410Z), but it sounds like we won't be able to get more information to investigate further why this might have been.

Has this issue recurred on the new deployment?

Sincerely,
Eric

Comment by Wernfried Domscheit [ 24/Jun/21 ]

Hi Eric

I uploaded the logfile.

{
	"total_count": 1,
	"entries": [
		{
			"type": "file",
			"id": "825700467244",
			"name": "mongod.log.6.gz",
			"description": "",
			"size": 2207372,

Sorry, the diagnostic file is not available anymore. This mongo deployment is still a "proof of concept", so I wiped it and deployed a new one.

Next time I will make a copy of the dagnostic path.

Best Regards
Wernfried

 

Comment by Eric Sedor [ 23/Jun/21 ]

Hi wernfried.domscheit@sunrise.net,

We'd like to understand what the balancer is doing. Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) for the primary node of the config server replica set?

Then, upload that archive to this this portal? Files uploaded here are only visible to MongoDB employees and are routinely deleted after some time.

Thank you,
Eric

Comment by Wernfried Domscheit [ 21/Jun/21 ]

By using

db.currentOp()
db.killOp()

I managed it to drop the collection. However, the Balancer turns to the next collection, attempts to balance it and hangs again.

Best Regards
Wernfried

Comment by Wernfried Domscheit [ 18/Jun/21 ]

Some more information:

 

db.statistics.getShardDistribution()
 
Shard shard_01 at shard_01/d-mipmdb-sh1-01.swi.srse.net:27018,d-mipmdb-sh2-01.swi.srse.net:27018
 data : 2.83MiB docs : 64 chunks : 134
 estimated data per chunk : 21KiB
 estimated docs per chunk : 0Shard shard_02 at shard_02/d-mipmdb-sh1-02.swi.srse.net:27018,d-mipmdb-sh2-02.swi.srse.net:27018
 data : 14.15MiB docs : 320 chunks : 139
 estimated data per chunk : 104KiB
 estimated docs per chunk : 2Shard shard_04 at shard_04/d-mipmdb-sh1-04.swi.srse.net:27018,d-mipmdb-sh2-04.swi.srse.net:27018
 data : 5.66MiB docs : 128 chunks : 127
 estimated data per chunk : 45KiB
 estimated docs per chunk : 1Shard shard_03 at shard_03/d-mipmdb-sh1-03.swi.srse.net:27018,d-mipmdb-sh2-03.swi.srse.net:27018
 data : 11.32MiB docs : 256 chunks : 128
 estimated data per chunk : 90KiB
 estimated docs per chunk : 2Totals
 data : 33.96MiB docs : 768 chunks : 528
 
 Shard shard_01 contains 8.33% data, 8.33% docs in cluster, avg obj size on shard : 45KiB
 Shard shard_02 contains 41.66% data, 41.66% docs in cluster, avg obj size on shard : 45KiB
 Shard shard_04 contains 16.66% data, 16.66% docs in cluster, avg obj size on shard : 45KiB
 Shard shard_03 contains 33.33% data, 33.33% docs in cluster, avg obj size on shard : 45KiB
 
sh.balancerCollectionStatus( "mip.statistics")
 
{ 
    "balancerCompliant" : false, 
    "firstComplianceViolation" : "chunksImbalance", 
    "ok" : 1.0, 
    "operationTime" : Timestamp(1624018857, 1001), 
    "$clusterTime" : {
        "clusterTime" : Timestamp(1624018857, 46506), 
        "signature" : {
            "hash" : BinData(0, "2e7pmQLXCrAXEn2uuEMnIzKICWs="), 
            "keyId" : 6955920606428659733
        }
    }
}

Generated at Thu Feb 08 05:42:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.