[SERVER-34498] mongo balancer often won't start back up Created: 16/Apr/18  Updated: 27/Oct/23  Resolved: 16/Apr/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Matthew Kruse Assignee: Kaloian Manassiev
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-34504 sh.getBalancerState does not consider... Closed
Participants:

 Description   

sh.status() output

  active mongoses:
    "3.6.2" : 30
  balancer:
    Currently enabled:  yes
    Currently running:  no
    Failed balancer rounds in last 5 attempts:  0
    Migration Results for the last 24 hours:
        No recent migrations

No locks

mongos> db.locks.find({'_id': 'balancer'}).pretty()
mongos> 

This collection is sharded so it ought to be doing something. It's done nothing in over 4 days and I have multiple collections out of balance that it should be working on.

mongos> db.chunks.aggregate([ { $match: {ns: "some.collection"}}, { $group:{ _id: "$shard", cnt:

{ $sum: 1}

} }, {$sort: {"_id": 1}} ]);

{ "_id" : "rs0", "cnt" : 647 } { "_id" : "rs1", "cnt" : 648 } { "_id" : "rs2", "cnt" : 648 } { "_id" : "rs3", "cnt" : 648 } { "_id" : "rs4", "cnt" : 648 } { "_id" : "rs5", "cnt" : 648 } { "_id" : "rs6", "cnt" : 649 } { "_id" : "rs7", "cnt" : 648 } { "_id" : "rs8", "cnt" : 642 } { "_id" : "rs9", "cnt" : 591 }


mongos> printjson(sh.isBalancerRunning()); printjson(sh.isBalancerRunning()); new ISODate(); printjson(sh.startBalancer()); new ISODate(); db.version()
false
false
assert.soon failed, msg:Waited too long for lock balancer to change to state undefined
doassert@src/mongo/shell/assert.js:15:14
assert.soon@src/mongo/shell/assert.js:199:13
sh.waitForDLock@src/mongo/shell/utils_sh.js:194:1
sh.waitForBalancer@src/mongo/shell/utils_sh.js:287:9
sh.startBalancer@src/mongo/shell/utils_sh.js:164:5
@(shell):1:96

2018-04-16T10:19:51.289-0700 E QUERY [thread1] Error: assert.soon failed, msg:Waited too long for lock balancer to change to state undefined :
doassert@src/mongo/shell/assert.js:15:14
assert.soon@src/mongo/shell/assert.js:199:13
sh.waitForDLock@src/mongo/shell/utils_sh.js:194:1
sh.waitForBalancer@src/mongo/shell/utils_sh.js:287:9
sh.startBalancer@src/mongo/shell/utils_sh.js:164:5
@(shell):1:96

mongos> db.version()
3.6.2

 



 Comments   
Comment by Matthew Kruse [ 16/Apr/18 ]

Seeing chunks move now as well.

Comment by Matthew Kruse [ 16/Apr/18 ]

Ah that worked with a 3.6 shell, thanks.

Comment by Kaloian Manassiev [ 16/Apr/18 ]

From the call stack it looks like you are using an older shell. The 3.6 shell doesn't have any calls to waitForDLock and directly invokes the command: https://github.com/mongodb/mongo/blob/r3.6.2/src/mongo/shell/utils_sh.js#L169

Can you please try with the 3.6 shell?

Comment by Matthew Kruse [ 16/Apr/18 ]

I think its more complicated then that, I've been running

sh.startBalancer()

It always returns this error

mongos> sh.startBalancer()
assert.soon failed, msg:Waited too long for lock balancer to change to state undefined
doassert@src/mongo/shell/assert.js:15:14
assert.soon@src/mongo/shell/assert.js:199:13
sh.waitForDLock@src/mongo/shell/utils_sh.js:194:1
sh.waitForBalancer@src/mongo/shell/utils_sh.js:287:9
sh.startBalancer@src/mongo/shell/utils_sh.js:164:5
@(shell):1:1
 
2018-04-16T12:11:31.760-0700 E QUERY    [thread1] Error: assert.soon failed, msg:Waited too long for lock balancer to change to state undefined :
doassert@src/mongo/shell/assert.js:15:14
assert.soon@src/mongo/shell/assert.js:199:13
sh.waitForDLock@src/mongo/shell/utils_sh.js:194:1
sh.waitForBalancer@src/mongo/shell/utils_sh.js:287:9
sh.startBalancer@src/mongo/shell/utils_sh.js:164:5

Comment by Kaloian Manassiev [ 16/Apr/18 ]

The mode:off setting means that the balancer is disabled and this value is written when sh.stopBalancer() is run. This is a new setting we introduced in 3.4, but we masked it behind the balancerStart/balancerStop/balancerStatus commands so it doesn't need to be inspected directly. It looks like a bug that sh.getBalancerState() doesn't consult the balancerStatus command and looks directly at the settings.

You can re-enable the balancer by running sh.startBalancer(). In the mean time I will file a separate ticket to fix getBalancerState.

Comment by Matthew Kruse [ 16/Apr/18 ]

mongos> use config;
switched to db config
mongos> db.settings.find({}).pretty();
{ "_id" : "balancer", "stopped" : false, "mode" : "off" }

Comment by Kaloian Manassiev [ 16/Apr/18 ]

Can you please let me know what are the contents of the config.settings collection?

use config;
db.settings.find({}).pretty();

If the balancer is not explicitly stopped, then it is possible that there is a balancer window configured.

Comment by Matthew Kruse [ 16/Apr/18 ]

The mongoc primary is just repeating those above messages over and over now.

Comment by Matthew Kruse [ 16/Apr/18 ]

Did that

2018-04-16T10:49:59.468-0700 D SHARDING [Balancer] found 10 shards listed on config server(s) with lastVisibleOpTime: { ts: Timestamp(1523900999, 4), t: 18 }
2018-04-16T10:49:59.470-0700 D SHARDING [Balancer] Skipping balancing round because balancing is disabled
2018-04-16T10:49:59.503-0700 I NETWORK  [conn340] received client metadata from 10.54.144.44:33738 conn: { driver: { name: "NetworkInterfaceASIO-ShardRegistry", version: "3.6.2" }, os: { type: "Linux", name: "CentOS Linux release 7.4.1708 (Core) ", architecture: "x86_64", version: "Kernel 3.10.0-693.21.1.el7.x86_64" } }
2018-04-16T10:50:01.524-0700 D SHARDING [shard registry reload] Reloading shardRegistry
2018-04-16T10:50:01.525-0700 D SHARDING [shard registry reload] found 10 shards listed on config server(s) with lastVisibleOpTime: { ts: Timestamp(1523900999, 43), t: 18 }
2018-04-16T10:50:08.985-0700 I NETWORK  [conn337] end connection 10.55.201.2:45193 (179 connections now open)
2018-04-16T10:50:09.471-0700 D SHARDING [Balancer] found 10 shards listed on config server(s) with lastVisibleOpTime: { ts: Timestamp(1523901009, 3), t: 18 }
2018-04-16T10:50:09.473-0700 D SHARDING [Balancer] Skipping balancing round because balancing is disabled
2018-04-16T10:50:19.473-0700 D SHARDING [Balancer] found 10 shards listed on config server(s) with lastVisibleOpTime: { ts: Timestamp(1523901019, 33), t: 18 }
2018-04-16T10:50:19.475-0700 D SHARDING [Balancer] Skipping balancing round because balancing is disabled
2018-04-16T10:50:29.476-0700 D SHARDING [Balancer] found 10 shards listed on config server(s) with lastVisibleOpTime: { ts: Timestamp(1523901029, 4), t: 18 }
2018-04-16T10:50:29.477-0700 D SHARDING [Balancer] Skipping balancing round because balancing is disabled
2018-04-16T10:50:31.525-0700 D SHARDING [shard registry reload] Reloading shardRegistry
2018-04-16T10:50:31.525-0700 D SHARDING [shard registry reload] found 10 shards listed on config server(s) with lastVisibleOpTime: { ts: Timestamp(1523901029, 43), t: 18 }
2018-04-16T10:50:33.230-0700 I NETWORK  [conn285] end connection 10.54.144.35:37354 (178 connections now open)
2018-04-16T10:50:39.478-0700 D SHARDING [Balancer] found 10 shards listed on config server(s) with lastVisibleOpTime: { ts: Timestamp(1523901039, 2), t: 18 }
2018-04-16T10:50:39.480-0700 D SHARDING [Balancer] Skipping balancing round because balancing is disabled

The primary mongo c thinks balacning is disabled. Right after I pulled the config entries above, I ran this on mongos

mongos> sh.isBalancerRunning()
false
mongos> sh.getBalancerState()
true
mongos> 

Comment by Matthew Kruse [ 16/Apr/18 ]

The only thing abnormal I've come across is that the logging partition on the server one of the mongo c processes was running on filled up. Cloud manager health checks didn't trip and I could still log into the server and run queries.

I restarted the mongo c process and cleared up the disk space and I still see this on a mongos:

mongos> sh.isBalancerRunning()
false
mongos> sh.getBalancerState()
true
mongos> sh.startBalancer()
assert.soon failed, msg:Waited too long for lock balancer to change to state undefined
doassert@src/mongo/shell/assert.js:15:14
assert.soon@src/mongo/shell/assert.js:199:13
sh.waitForDLock@src/mongo/shell/utils_sh.js:194:1
sh.waitForBalancer@src/mongo/shell/utils_sh.js:287:9
sh.startBalancer@src/mongo/shell/utils_sh.js:164:5
@(shell):1:1
 
2018-04-16T10:48:20.663-0700 E QUERY    [thread1] Error: assert.soon failed, msg:Waited too long for lock balancer to change to state undefined :
doassert@src/mongo/shell/assert.js:15:14
assert.soon@src/mongo/shell/assert.js:199:13
sh.waitForDLock@src/mongo/shell/utils_sh.js:194:1
sh.waitForBalancer@src/mongo/shell/utils_sh.js:287:9
sh.startBalancer@src/mongo/shell/utils_sh.js:164:5
@(shell):1:1

Comment by Kaloian Manassiev [ 16/Apr/18 ]

Hi mkruse@adobe.com,

Would it be possible to increase the sharding component log level on the primary node of the config server, let it run for a few minutes and then attach the logs? This is the command:

db.adminCommand({ setParameter: 1, logComponentVerbosity: { sharding: 1 } });

Given the distribution above, you are right that the balancer should be moving chunks to shard rs9.

-Kal.

Comment by Matthew Kruse [ 16/Apr/18 ]

I've seen the balancer get hung exactly like this on 3.6, 3.4, 3.2 and 3.0. I'd expect the balancer process to recognize a hung state and resolve itself quickly. It doesn't seem capable of doing that in all balancer failure cases.

Prior to the balancer getting hung in this state, I disabled it, ran a bunch of moveChunk commands and attempted to turn it back on.

Comment by Matthew Kruse [ 16/Apr/18 ]

I've also manually deleted the balancer lock in the config.locks collection. That didn't help either.

Comment by Matthew Kruse [ 16/Apr/18 ]

I've bounced all mongo s and mongo c in the cluster, no luck getting whatever is broke to resolve itself.

Generated at Thu Feb 08 04:36:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.