[SERVER-29599] Balancer never relinquishes lock Created: 13/Jun/17  Updated: 27/Oct/23  Resolved: 13/Jun/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Scott Glajch Assignee: Kaloian Manassiev
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

3.4.4 sharded cluster with 18 shards, each consisting of 1 replica, 1 primary, and 1 hidden replica. 3 config servers (CSRS) and 5 mongoS


Operating System: ALL
Steps To Reproduce:
  1. Stop the balancer
  2. Wait for the balancer to finish it's migration and stop
  3. Check locks collection for the balancer lock

Let me know if you need more information to help reproduce. I'm not sure what you need now but I'm sure you'll need something.

Participants:

 Description   

After upgrading our main mongo cluster from 3.2.12 to 3.4.4, we've noticed a weird behavior where the balancer never relinquishes it's lock. I can run sh.isBalancerRunning() and sh.getBalancerState(), both of which return false, but the balancer lock still shows a state of "2".
Found using:

db.getSiblingDB("config").locks.findOne({_id: "balancer"}).state

I've checked the changelog collection and haven't found any evidence there that the balancer is still actually running.

We also have had a problem for a while with moving chunks in this cluster due to mismatching index definitions on the various shards, which we are blocked from repairing due to another bug with dropping indexes which I'll log elsewhere and link to this.

We turn off the balancer every night to do some system maintenance, and for now we've been having to manually free the balancer lock otherwise this maintenance gets stuck waiting for the balancer to finish it's migration.

On a possibly related note, I've had to fix this balancer lock a few times in the past few days, so either some process on our end keeps re-enabling the balancer, or the lock keeps getting re-established on its own.



 Comments   
Comment by Scott Glajch [ 13/Jun/17 ]

You're right, after looking into it, we had written direct code on our end to check for the lock state. I've updated our code and everything is fine now. Thank you for the quick response!

Comment by Kaloian Manassiev [ 13/Jun/17 ]

I don't think the MongoDB Java driver has any means for controlling the balancer, only the shell helpers do.

Comment by Scott Glajch [ 13/Jun/17 ]

Ok thanks! I think perhaps the java mongo driver we're using to determine if the balancer is still running might just need an upgrade. Hopefully that fixes our issue. I'll get back to you shortly on that.

Comment by Kaloian Manassiev [ 13/Jun/17 ]

Hi glajchs,

Starting in MongoDB version 3.4 we moved the balancer to run on the primary of the config server. As of this change, the balancer lock is intentionally not released, in order to prevent any accidentally left 3.2 or earlier mongos nodes from taking it. This is documented here.

This indeed means that some of the older mongo shell utilities are not compatible with 3.4, so we recommend using the 3.4 shell. The implementation of sh.isBalancerRunning now uses a new command called balancerStatus.

Hope this helps.

Best regards,
-Kal.

Generated at Thu Feb 08 04:21:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.