Loading...

Type: Bug
Resolution: Cannot Reproduce
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.4.8, 2.4.10, 2.6.4
Component/s: Sharding
Labels:

Operating System:
ALL
Steps To Reproduce:
Hide

Launch a simple sharded cluster

mlaunch --single --sharded 3

Start the repro.sh script (attached)
This prints a '.' continuously until it finds any lock in state 2 (triggered by the code in the next step).
It then stops and restarts the config server after which it enters another loop printing the # of locks it finds in state 2.
The loop terminates if and when the find operation returns no acquired locks. Often this loop never terminates implying there are stale locks

Connect to the mongos and run the following code to trigger the balancer:

o = new Array(1024*1024*4).join('.') use foo db.bar.insert({data:o}) sh.enableSharding("foo.bar") sh.shardCollection("foo.bar",{_id:1}) for (var i = 0; i < 1000; i++) db.bar.insert({data:o})

After (hopefully) just a few iterations of the loop the repro.sh script should stop & restart the config server.
Once the config server restarts there should be some locks stuck in state 2. This is shown by a continuous stream of numbers being output, ie the script doesn't terminate as these locks are in a constant acquired state.

Sometimes it may be necessary to run the repro.sh script more than once for it to demonstrate the problem.

Typically I'm seeing 1 lock left in state 2, sometimes 2, when the config server restarts - sometimes (rarely) it's the balancer lock itself, sometimes it's a collection lock.

When the balancer lock is stuck it becomes impossible to start/stop the balancer, ie sh.stopBalancer() will timeout.

When a collection lock is stuck subsequent attempts to move chunks fail, e.g. the following code fails:

sh.stopBalancer() db.getSisterDB("config").chunks.find().forEach(function(doc) { db.runCommand({moveChunk: "foo.bar", bounds: [{_id:doc.min._id},{_id:doc.max._id}], to: "shard0000"})}); sh.startBalancer()

It results in aborted messages in the mongod.log files and aborted entries in the changelog collection. And while the balancer can start & stop in this state it too is unable to move chunks, effectively disabling it ability to balance the cluster.
Show
Launch a simple sharded cluster mlaunch --single --sharded 3 Start the repro.sh script (attached) This prints a '.' continuously until it finds any lock in state 2 (triggered by the code in the next step). It then stops and restarts the config server after which it enters another loop printing the # of locks it finds in state 2. The loop terminates if and when the find operation returns no acquired locks. Often this loop never terminates implying there are stale locks Connect to the mongos and run the following code to trigger the balancer : o = new Array(1024*1024*4).join( '.' ) use foo db.bar.insert({data:o}) sh.enableSharding( "foo.bar" ) sh.shardCollection( "foo.bar" ,{_id:1}) for ( var i = 0; i < 1000; i++) db.bar.insert({data:o}) After (hopefully) just a few iterations of the loop the repro.sh script should stop & restart the config server. Once the config server restarts there should be some locks stuck in state 2. This is shown by a continuous stream of numbers being output, ie the script doesn't terminate as these locks are in a constant acquired state. Sometimes it may be necessary to run the repro.sh script more than once for it to demonstrate the problem. Typically I'm seeing 1 lock left in state 2, sometimes 2, when the config server restarts - sometimes (rarely) it's the balancer lock itself, sometimes it's a collection lock. When the balancer lock is stuck it becomes impossible to start/stop the balancer , ie sh.stopBalancer() will timeout. When a collection lock is stuck subsequent attempts to move chunks fail, e.g. the following code fails: sh.stopBalancer() db.getSisterDB( "config" ).chunks.find().forEach(function(doc) { db.runCommand({moveChunk: "foo.bar" , bounds: [{_id:doc.min._id},{_id:doc.max._id}], to: "shard0000" })}); sh.startBalancer() It results in aborted messages in the mongod.log files and aborted entries in the changelog collection. And while the balancer can start & stop in this state it too is unable to move chunks, effectively disabling it ability to balance the cluster.
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

If the (primary) config server goes down while the balancer is running it can result in stale locks which don't get released even after the config server is brought back up. This can result in a failure to start/stop the balancer and an inability to perform moveChunk commands. Manual intervention is then required to verify the locks are stale and to release them.

There appear to be a few scenarios:

In rare cases the balancer lock may get stuck, resulting in an inability to start/stop the balancer, ie sh.stopBalancer() will timeout and display an error:

> sh.stopBalancer()
Updated 1 new record(s) in 7ms
Waiting for active hosts...
Waiting for the balancer lock...
assert.soon failed: function (){
        var lock = db.getSisterDB( "config" ).locks.findOne({ _id : lockId })

        if( state == false ) return ! lock || lock.state == 0
        if( state == true ) return lock && lock.state == 2
        if( state == undefined ) return (beginTS == undefined && lock) ||
                                        (beginTS != undefined && ( !lock || lock.ts + "" != beginTS + "" ) )
        else return lock && lock.state == state
    }, msg:Waited too long for lock balancer to unlock
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at Function.assert.soon (src/mongo/shell/assert.js:110:60)
    at Function.sh.waitForDLock (src/mongo/shell/utils_sh.js:156:12)
    at Function.sh.waitForBalancerOff (src/mongo/shell/utils_sh.js:224:12)
    at Function.sh.waitForBalancer (src/mongo/shell/utils_sh.js:254:12)
    at Function.sh.stopBalancer (src/mongo/shell/utils_sh.js:126:8)
    at (shell):1:4
Balancer still may be active, you must manually verify this is not the case using the config.changelog collection.
Thu Aug 21 23:01:13.142 assert.soon failed: function (){
        var lock = db.getSisterDB( "config" ).locks.findOne({ _id : lockId })

        if( state == false ) return ! lock || lock.state == 0
        if( state == true ) return lock && lock.state == 2
        if( state == undefined ) return (beginTS == undefined && lock) ||
                                        (beginTS != undefined && ( !lock || lock.ts + "" != beginTS + "" ) )
        else return lock && lock.state == state
    }, msg:Waited too long for lock balancer to unlock at src/mongo/shell/utils_sh.js:228

I also see the following:

> use config
switched to db config
> db.locks.find({state:2}, {_id:1, why:1})
{
  "_id": "balancer",
  "why": "doing balance round"
}

a collection lock may get stuck, resulting in the balancer failing to move chunks. This leads to errors in the mongod.log files:

Thu Aug 21 23:26:35.362 [conn5] about to log metadata event: { _id: "hal.local-2014-08-21T22:26:35-53f6721b510fb7b6210e5146", server: "hal.local", clientAddr: "192.168.1.129:53350", time: new Date(1408659995362), what: "moveChunk.from", ns: "foo.bar", details: { min: { _id: 2.0 }, max: { _id: 3.0 }, step1 of 6: 0, note: "aborted" } }

Similarly the changelog collection shows aborted messages:

> db.changelog.find({}, {"details.note":1}).sort({time:-1}).limit(1)
{
  "_id": "hal.local-2014-08-21T22:37:18-53f6749e510fb7b6210e51b1",
  "details": {
    "note": "aborted"
  },
  "time": ISODate("2014-08-21T22:37:18.137Z")
}

The locks collection also shows entries similar to:

> db.locks.find({state:2}, {_id:1, why:1})
{
  "_id": "foo.bar",
  "why": "migrate-{ _id: 2.0 }"
}

The behavior in 2.4.x vs 2.6.x is a little different:

2.4.x seems quite susceptible to the problem and it can be triggered quite easily using the repro steps provided.
2.6.x seems to recover from stale locks shortly after the restart of the config server in most instances but there have been rare occasions where it also maintains a lock after the restart. When there are stale locks in 2.6.x it typically relates to just the balancer lock.

While my reproduction steps involve taking down the config server it is also likely that the same problem could occur if there are network issues between cluster members and the config server

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

repro.sh
0.9 kB
Aug 21 2014 11:05:54 PM UTC

is related to

SERVER-7260 Balancer lock is not relinquished

Closed

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates