[SERVER-7260] Balancer lock is not relinquished Created: 04/Oct/12 Updated: 22/Dec/14 Resolved: 03/Jul/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.0.7, 2.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ben Becker | Assignee: | Randolph Tan |
| Resolution: | Done | Votes: | 4 |
| Labels: | balancer, migration | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Under certain conditions, the balancer lock may never be relinquished. One case appeasr to have occured when the balancer state was disabled during a chunk migration:
Note the above output was taken 15 hours after the last moveChunk was logged to the config server. It's unclear if the mongos process holding the lock was killed before it had a chance to release the lock. The net effect is that sh.isBalancerRunning() never returns false, even if the balancer is no longer running. |
| Comments |
| Comment by Andrey Miroshnichenko [ 22/Dec/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have the same issue now on v. 2.2.3.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Thomas Rueckstiess [ 03/Jul/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Guoming, We haven't heard back from you and we can't continue the diagnosis without additional information. I'm going to resolve this ticket now. If this is still an issue, feel free to re-open the ticket and provide the verbose logs Randolph was asking for. Thanks, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 10/Jun/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
xuguoming@nexlabs.com, we haven't heard back from you for a while. If this is still an issue for you, can you please send us the information that renctan requested? Thanks, Ramón. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 21/Apr/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It appears that there is a problem in one of the config servers. I need to gather more information before I can determine the best course of action. Can you set the verbosity to all config servers to 1 and upload the logs? You can do this by connecting to each of the config servers and send the command:
You can also use the same command to set the log verbosity back to the original value. Thanks! | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yang [ 21/Apr/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Randolph, I see a number of errors as below. It appears something wrong with the config DBs. How do I fix that? Thu Apr 17 20:00:06.384 [LockPinger] creating distributed lock ping thread for configserver1:port1,configserver2:port2,configserver3:port3 and process mongosserver:port4:1397702879:41 (sleeping for 30000ms) update: { $set: { state: 1, who: "mongosserver:port4:1397702879:41:Balancer:18467", process: "mongosserver:port4:1397702879:41", when: new Date(1397736006387), why: "doing balance round", ts: ObjectId('534fc2462db7a3a1f9f24e8e') }} gle1: { err: "!loc.isNull()", n: 0, connectionId: 35, waited: 0, ok: 1.0 }gle2: { updatedExisting: true, n: 1, lastOp: Timestamp 1397736006000|2, connectionId: 28, waited: 44, err: null, ok: 1.0 }Thu Apr 17 20:00:07.361 [LockPinger] scoped connection to configserver1:port1,configserver2:port2,configserver3:port3 not being returned to the pool update: { $set: { ping: new Date(1397736006384) }} gle1: { err: "!loc.isNull()", n: 0, connectionId: 5, waited: 387, ok: 1.0 }gle2: { updatedExisting: false, n: 1, lastOp: Timestamp 1397736006000|1, connectionId: 5, waited: 44, err: null, ok: 1.0 }Thu Apr 17 20:00:07.369 [Balancer] lock update won, completing lock propagation for 'balancer/mongosserver:port4:1397702879:41' update: { $set: { state: 2, who: "mongosserver:port4:1397702879:41:Balancer:18467", process: "mongosserver:port4:1397702879:41", when: new Date(1397736006387), why: "doing balance round", ts: ObjectId('534fc2462db7a3a1f9f24e8e') }} gle1: { err: "!loc.isNull()", n: 0, connectionId: 35, waited: 140, ok: 1.0 }gle2: { updatedExisting: true, n: 1, lastOp: Timestamp 1397736007000|1, connectionId: 28, waited: 11, err: null, ok: 1.0 }Thu Apr 17 20:00:13.943 [Balancer] SyncClusterConnection connecting to [configserver1:port1] update: { $set: { ping: new Date(1397736037361) }} gle1: { err: "!loc.isNull()", n: 0, connectionId: 36, waited: 150, ok: 1.0 }gle2: { updatedExisting: true, n: 1, lastOp: Timestamp 1397736037000|1, connectionId: 29, waited: 30, err: null, ok: 1.0 }Thu Apr 17 20:00:39.369 [Balancer] SyncClusterConnection connecting to [configserver1:port1] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 17/Apr/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Is the balancer lock still taken? If yes, can you provide the following: 1. mongodump of the config database.
Thanks! | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yang [ 17/Apr/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Randolph Tan, No. It's doing nothing. Last entry from changelog is two days ago. After trying sh.stopBalancer() a few times, I get to see "stopped:true" in settings for balancer entry. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 16/Apr/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Was there any balancer activity during that time? Is the same lock still being held? The balancer lock is held for the entire duration of the balancing round and this can sometimes take a long time, depending on many chunks it need to move. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yang [ 16/Apr/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I get similar issue too in MongoDB 2.4.5 running in WIN Sever 2008R2/SP1. In addition to similar changelog and locks records as posted by Stephen Hooper above, I also noticed it's impossible to stop the balancer by issuing command sh.stopBalancer() as I receive errors as below. ) if( state == false ) return ! lock || lock.state == 0 ) if( state == false ) return ! lock || lock.state == 0 update: {$set: { stopped: true }} gle1: { err: "!loc.isNull()", n: 0, connectionId: 4655 , waited: 130, ok: 1.0 }gle2: { updatedExisting: true, n: 1, lastOp: Timestamp1397637878000|1, connectionId: 4639, waited: 14, err: null, ok: 1.0 } | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 05/Dec/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
state of 1 is an intermediate state that means that the lock is currently being contended by multiple participants requesting for the lock. The time it takes to resolve (move to state of 2) depends on the number of participants. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Egan Neuhengen [ 04/Dec/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This looks old and left uncompleted, but I would like to report seeing this bug in version 2.4.8. The outputs to these commands distinctly imply this error: mongos> sh.isBalancerRunning() Note that state is 1, even though we're on version 2.4 and the documentation states an active lock should have a state of 2 in versions 2.0+. I do not know what this signifies. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 22/Feb/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Found out that distributed locks can be left locked when the mongos process exits while holding the lock. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 03/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The balancer holds the balancer lock for a single balancing round and should release it afterwards. However, there can be more than one chunk that needs to be moved within a single round (which involves several collections), so the lock can be held over an extended periods of time. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Stephen Hooper [ 28/Dec/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I am also seeing something similar in 2.2.2 A single chunk is moved over but the locks remain open.
The change log shows the chunk commited.
However the locks remain open and a single thread is stuck in step 5 of 6 for a signfigant amount of time
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ben Becker [ 08/Oct/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, the skipped entries were strictly splits, to the best of my recollection. Perhaps this is also related to Running short on time to determine exact repro steps; can dig further though. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 06/Oct/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Are you sure there wasn't a different thing moving? |