[SERVER-8348] balancer locked in state 0 Created: 25/Jan/13 Updated: 01/Apr/13 Resolved: 01/Apr/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.2.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | joe piscitella | Assignee: | Andre de Frere |
| Resolution: | Done | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
ubuntu 12.04.01 LTS |
||
| Attachments: |
|
| Operating System: | Linux |
| Steps To Reproduce: | above is the information from the mongos log when the lock was acquired and the output of the mongos locks. |
| Participants: |
| Description |
|
Balancer seems to have gotten stuck after acquiring a lock which then stopped all balancing until we logged into the mongos and stop/started the balancer. mongos log- use config; ).pretty() we didn't notice this until today that we had no balancing happening. |
| Comments |
| Comment by joe piscitella [ 25/Feb/13 ] | |||
|
Confirmed that it was a code bug and not a mongo bug. Please close the issue. | |||
| Comment by Andre de Frere [ 25/Feb/13 ] | |||
|
Hi Joe, I'll set this to Waiting on User Input until we hear back from you on how your testing went. Regards, | |||
| Comment by joe piscitella [ 22/Feb/13 ] | |||
|
Andre, thank you for the feedback. | |||
| Comment by Andre de Frere [ 19/Feb/13 ] | |||
|
Hi Joe,
Are you consistently seeing these messages at the time you are stopping the balancer before your backup?
Do you check both that the balancer is stopped (or more accurately, has had a stop requested) and that the balancer is no longer balancing? This is the difference between sh.getBalancerState() and sh.isBalancerRunning(). The balancer can be requested to stop by changing the collection (as you are doing), but may still be in the middle of a balance round and therefore will still report as running (or the state key in the locks collection will be >0 for { _id : 'balancer' }). For this reason, the sh.stopBalancer() helper will wait until the balancer has actually finished before returning. Regards, | |||
| Comment by joe piscitella [ 18/Feb/13 ] | |||
|
seeing a lot of those "did not propagate properly. :: caused by :: 8017 update not consistent" errors in mongos logs. after running the dbhash on all three config servers the only difference is the changelog in all three config servers. the balancer is getting stop/started properly during the backups as after running the stop I do a find to get the state and echo it to the log which is showing stopped state = ture | |||
| Comment by Andre de Frere [ 18/Feb/13 ] | |||
|
Hi Joe, The "o" log file you have attached indicates that the balancer lock is being prevented from being set to 1 because the update is not consistent:
Does this message appear every night during your backup routine? Do you check the locks collection after updating to ensure the balancer is off? Can you run a dbHash on the config database from each of the three config servers and ensure that the dbs have the same hash? The command to do this looks like the following:
if this happens reproducibly, would you be able to run the dbHash on all three config servers after reproducing the issue? Regards, | |||
| Comment by joe piscitella [ 11/Feb/13 ] | |||
|
This log does show some promise as to what is happening. Shows some errors leading up to when the balancer is turned off for backups. Mon Feb 11 00:45:01 2013 Checking for a balancer.... | |||
| Comment by joe piscitella [ 10/Feb/13 ] | |||
|
This is happening every night after the balancer is turned off for backups , $coll->update( {"_id" => "balancer" }, { '$set' => { "stopped"=> "true" }}); . After turning the balancer back on it never comes to life again. Each morning I restart it via the sh.start/stopBalancer in any mongos and it starts working again. The problem is there is no error so I am not sure what to attach. | |||
| Comment by Ian Daniel [ 04/Feb/13 ] | |||
|
Please can you attach the log file from the mongos covering the period in which you had this error. |