[SERVER-8348] balancer locked in state 0 Created: 25/Jan/13  Updated: 01/Apr/13  Resolved: 01/Apr/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: joe piscitella Assignee: Andre de Frere
Resolution: Done Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu 12.04.01 LTS


Attachments: File mlog     File o    
Operating System: Linux
Steps To Reproduce:

above is the information from the mongos log when the lock was acquired and the output of the mongos locks.

Participants:

 Description   

Balancer seems to have gotten stuck after acquiring a lock which then stopped all balancing until we logged into the mongos and stop/started the balancer.

mongos log-
Fri Jan 18 00:45:02 [Balancer] distributed lock 'balancer/ms-db-4:27017:1354112926:1804289383' acquired, ts : 50f8fd7e8d5ca2b8deff6383

use config;
mongos> db.locks.find(

{ _id : "balancer" }

).pretty()
{
"_id" : "balancer",
"process" : "ms-db-4:27017:1354112926:1804289383",
"state" : 0,
"ts" : ObjectId("50f8fd7e8d5ca2b8deff6383"),
"when" : ISODate("2013-01-18T07:45:02.249Z"),
"who" : "ms-db-4:27017:1354112926:1804289383:Balancer:846930886",
"why" : "doing balance round"
}

we didn't notice this until today that we had no balancing happening.



 Comments   
Comment by joe piscitella [ 25/Feb/13 ]

Confirmed that it was a code bug and not a mongo bug. Please close the issue.

Comment by Andre de Frere [ 25/Feb/13 ]

Hi Joe,

I'll set this to Waiting on User Input until we hear back from you on how your testing went.

Regards,
André

Comment by joe piscitella [ 22/Feb/13 ]

Andre, thank you for the feedback.
changed backup code and testing

Comment by Andre de Frere [ 19/Feb/13 ]

Hi Joe,

seeing a lot of those "did not propagate properly. :: caused by :: 8017 update not consistent" errors in mongos logs.

Are you consistently seeing these messages at the time you are stopping the balancer before your backup?

the balancer is getting stop/started properly during the backups as after running the stop I do a find to get the state and echo it to the log which is showing stopped state = ture

Do you check both that the balancer is stopped (or more accurately, has had a stop requested) and that the balancer is no longer balancing? This is the difference between sh.getBalancerState() and sh.isBalancerRunning(). The balancer can be requested to stop by changing the collection (as you are doing), but may still be in the middle of a balance round and therefore will still report as running (or the state key in the locks collection will be >0 for

{ _id : 'balancer' }

). For this reason, the sh.stopBalancer() helper will wait until the balancer has actually finished before returning.

Regards,
André

Comment by joe piscitella [ 18/Feb/13 ]

seeing a lot of those "did not propagate properly. :: caused by :: 8017 update not consistent" errors in mongos logs.

after running the dbhash on all three config servers the only difference is the changelog in all three config servers.

the balancer is getting stop/started properly during the backups as after running the stop I do a find to get the state and echo it to the log which is showing stopped state = ture

Comment by Andre de Frere [ 18/Feb/13 ]

Hi Joe,

The "o" log file you have attached indicates that the balancer lock is being prevented from being set to 1 because the update is not consistent:

Mon Feb 11 00:43:43 [Balancer] warning: distributed lock 'balancer/db-mdb-12.:27017:1355258231:1804289383 did not propagate properly. :: caused by :: 8017 update not consistent  ns: config.locks query: { _id: "balancer", state: 0, ts: ObjectId('5118a12b57a3e60537d10d90') } update: { $set: { state: 1, who: "db-mdb-12.:27017:1355258231:1804289383:Balancer:846930886", process: "db-mdb-12.:27017:1355258231:1804289383", when: new Date(1360568623601), why: "doing balance round", ts: ObjectId('5118a12ff72247b5d1e98121') } } gle1: { updatedExisting: true, n: 1, connectionId: 211471, waited: 84, err: null, ok: 1.0 } gle2: { updatedExisting: false, n: 0, connectionId: 210743, waited: 26, err: null, ok: 1.0 }

Does this message appear every night during your backup routine? Do you check the locks collection after updating to ensure the balancer is off? Can you run a dbHash on the config database from each of the three config servers and ensure that the dbs have the same hash? The command to do this looks like the following:

use config
db.runCommand({dbHash:1})

if this happens reproducibly, would you be able to run the dbHash on all three config servers after reproducing the issue?

Regards,
André

Comment by joe piscitella [ 11/Feb/13 ]

This log does show some promise as to what is happening. Shows some errors leading up to when the balancer is turned off for backups.

Mon Feb 11 00:45:01 2013 Checking for a balancer....
Mon Feb 11 00:45:02 2013 Balancer Found!
Mon Feb 11 00:45:02 2013 Writing balancer stopped state to file
Mon Feb 11 00:45:02 2013 Pre-backup balancer stopped state is : 0
Mon Feb 11 00:45:02 2013 Stopping balancer....
Mon Feb 11 00:45:02 2013 Checking current balancer state...
Mon Feb 11 00:45:02 2013 Balancer stopped status is : true
Mon Feb 11 00:45:02 2013 Checking balancer active state...
Mon Feb 11 00:45:02 2013 Balancer active state is : 0

Comment by joe piscitella [ 10/Feb/13 ]

This is happening every night after the balancer is turned off for backups , $coll->update(

{"_id" => "balancer" }

, { '$set' =>

{ "stopped"=> "true" }

}); . After turning the balancer back on it never comes to life again. Each morning I restart it via the sh.start/stopBalancer in any mongos and it starts working again.

The problem is there is no error so I am not sure what to attach.

Comment by Ian Daniel [ 04/Feb/13 ]

Please can you attach the log file from the mongos covering the period in which you had this error.

Generated at Thu Feb 08 03:17:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.