[SERVER-31075] Balancer is running but it's disabled Created: 13/Sep/17  Updated: 27/Oct/23  Resolved: 14/Sep/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Mark Agarunov
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

The balancer is running even though it's disabled:

mongos> sh.getBalancerState()
false
mongos> sh.isBalancerRunning()
true


In config.locks there is a balancer entry:

mongos> db.locks.find({_id:"balancer"})
{ "_id" : "balancer", "state" : 2, "ts" : ObjectId("59b93d04d035a79b193cea5b"), "who" : "ConfigServer:Balancer", "process" : "ConfigServer", "when" : ISODate("2017-09-13T14:13:37.034Z"), "why" : "CSRS Balancer" }


We restarted all routers (mongos) and configservers but the balancer is still running.
Btw. the dbs is not yet in use and therefore nearly empty since it's not yet in production.

mongos> show dbs
admin   0.000GB
config  0.007GB
mongos> sh.status()
--- Sharding Status ---
  sharding version: {
	"_id" : 1,
	"minCompatibleVersion" : 5,
	"currentVersion" : 6,
	"clusterId" : ObjectId("598ab847e97e92ba40fa56ee")
}
  shards:
	{  "_id" : "offerHistory01",  "host" : "offerHistory01/mongo-016.db00.pro06.eu.idealo.com:27017,mongo-105.db00.pro05.eu.idealo.com:27017",  "state" : 1 }
	{  "_id" : "offerHistory02",  "host" : "offerHistory02/mongo-119.db00.pro06.eu.idealo.com:27017,mongo-120.db00.pro06.eu.idealo.com:27017",  "state" : 1 }
	{  "_id" : "offerHistory03",  "host" : "offerHistory03/mongo-121.db00.pro06.eu.idealo.com:27017,mongo-122.db00.pro06.eu.idealo.com:27017",  "state" : 1 }
	{  "_id" : "offerHistory04",  "host" : "offerHistory04/mongo-047.db00.pro06.eu.idealo.com:27017,mongo-123.db00.pro06.eu.idealo.com:27017",  "state" : 1 }
  active mongoses:
	"3.4.6" : 5
  balancer:
	Currently enabled:  no
	Currently running:  yes
		Balancer lock taken at Wed Sep 13 2017 16:13:37 GMT+0200 (CEST) by ConfigServer:Balancer
	Failed balancer rounds in last 5 attempts:  5
	Last reported error:  Cannot accept sharding commands if not started with --shardsvr
	Time of Reported error:  Fri Aug 25 2017 08:46:00 GMT+0200 (CEST)
	Migration Results for the last 24 hours:
		3 : Success
  databases:
mongos>



 Comments   
Comment by Kaloian Manassiev [ 14/Sep/17 ]

Thanks for confirming, kay.agahd@idealo.de!

Just for posterity, this is the documentation link, which explains the 3.4 behaviour change: https://docs.mongodb.com/manual/tutorial/manage-sharded-cluster-balancer/#check-if-balancer-is-running

Comment by Kay Agahd [ 14/Sep/17 ]

Hi kaloian.manassiev,

that's it, you're right! We were using a mongo shell v3.2. Using mongo shell v3.4 correctly shows that the balancer is stopped:

MongoDB shell version v3.4.6
connecting to: mongodb://localhost:27017/admin
MongoDB server version: 3.4.6
mongos> sh.getBalancerState()
false
mongos> sh.isBalancerRunning()
false
mongos>


You may close this ticket as resolved.
Thanks!

Comment by Kaloian Manassiev [ 14/Sep/17 ]

Hi kay.agahd@idealo.de,

Can you please confirm whether you are running version 3.4 of the mongo shell to run these commands?

In MongoDB 3.4 we moved the balancer to run on the CSRS config server's primary and as a result it now permanently holds the balancer lock in order to prevent any accidentally left 3.2 or prior mongos instances for doing balancing.

Because of this, all functions from shells before 3.4, which rely on the status of the balancer lock, no longer work. Instead, we introduced a new command.

Best regards,
-Kal.

Comment by Kay Agahd [ 14/Sep/17 ]

Hello mark.agarunov,

I tried to upload the log files without success

scp -P 722 file.log.tgz SERVER-31075@www.mongodb.com:


Could you provide us a non-public upload location please?

We generally keep our log files for only 7 days so probably the log file containing the cause has been rotated already.
However, I kept the log file of the router which was exclusively used during our pre-service testing. It may contain what you are looking for. Its size is 991 MB uncompressed, 66 MB compressed.

Comment by Mark Agarunov [ 13/Sep/17 ]

Hello kay.agahd@idealo.de,

Thank you for the report. It seems from the output that there may be a lock that has not been relinquished from a previous run of the balancer. To get a better idea of what may be causing this, could you please provide the logs from all affected mongos and mongod nodes?

Thanks,
Mark

Comment by Kay Agahd [ 13/Sep/17 ]

Also quite weird:

mongos> db.settings.find()
{ "_id" : "balancer", "stopped" : true, "mode" : "full" }
{ "_id" : "chunksize", "value" : 1024 }
mongos> sh.isBalancerRunning()
true

Generated at Thu Feb 08 04:25:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.