[SERVER-14996] Locks not released if config server goes down / balancer operation and moveChunk command stop working Created: 21/Aug/14  Updated: 21/May/15  Resolved: 21/May/15

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.8, 2.4.10, 2.6.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ronan Bohan Assignee: Ramon Fernandez Marina
Resolution: Cannot Reproduce Votes: 0
Labels: balancer, moveChunk, sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro.sh    
Issue Links:
Related
is related to SERVER-7260 Balancer lock is not relinquished Closed
Operating System: ALL
Steps To Reproduce:
  1. Launch a simple sharded cluster

     mlaunch --single --sharded 3
     

  2. Start the repro.sh script (attached)
    This prints a '.' continuously until it finds any lock in state 2 (triggered by the code in the next step).
    It then stops and restarts the config server after which it enters another loop printing the # of locks it finds in state 2.
    The loop terminates if and when the find operation returns no acquired locks. Often this loop never terminates implying there are stale locks
  3. Connect to the mongos and run the following code to trigger the balancer:

     o = new Array(1024*1024*4).join('.')
     use foo
     db.bar.insert({data:o})
     sh.enableSharding("foo.bar")
     sh.shardCollection("foo.bar",{_id:1})
     for (var i = 0; i < 1000; i++) db.bar.insert({data:o})

  4. After (hopefully) just a few iterations of the loop the repro.sh script should stop & restart the config server.
    Once the config server restarts there should be some locks stuck in state 2. This is shown by a continuous stream of numbers being output, ie the script doesn't terminate as these locks are in a constant acquired state.

Sometimes it may be necessary to run the repro.sh script more than once for it to demonstrate the problem.

Typically I'm seeing 1 lock left in state 2, sometimes 2, when the config server restarts - sometimes (rarely) it's the balancer lock itself, sometimes it's a collection lock.

When the balancer lock is stuck it becomes impossible to start/stop the balancer, ie sh.stopBalancer() will timeout.

When a collection lock is stuck subsequent attempts to move chunks fail, e.g. the following code fails:

sh.stopBalancer()
db.getSisterDB("config").chunks.find().forEach(function(doc) { db.runCommand({moveChunk: "foo.bar", bounds: [{_id:doc.min._id},{_id:doc.max._id}], to: "shard0000"})});
sh.startBalancer()

It results in aborted messages in the mongod.log files and aborted entries in the changelog collection. And while the balancer can start & stop in this state it too is unable to move chunks, effectively disabling it ability to balance the cluster.

Participants:

 Description   

If the (primary) config server goes down while the balancer is running it can result in stale locks which don't get released even after the config server is brought back up. This can result in a failure to start/stop the balancer and an inability to perform moveChunk commands. Manual intervention is then required to verify the locks are stale and to release them.

There appear to be a few scenarios:

  • In rare cases the balancer lock may get stuck, resulting in an inability to start/stop the balancer, ie sh.stopBalancer() will timeout and display an error:

    > sh.stopBalancer()
    Updated 1 new record(s) in 7ms
    Waiting for active hosts...
    Waiting for the balancer lock...
    assert.soon failed: function (){
            var lock = db.getSisterDB( "config" ).locks.findOne({ _id : lockId })
     
            if( state == false ) return ! lock || lock.state == 0
            if( state == true ) return lock && lock.state == 2
            if( state == undefined ) return (beginTS == undefined && lock) ||
                                            (beginTS != undefined && ( !lock || lock.ts + "" != beginTS + "" ) )
            else return lock && lock.state == state
        }, msg:Waited too long for lock balancer to unlock
    Error: Printing Stack Trace
        at printStackTrace (src/mongo/shell/utils.js:37:15)
        at doassert (src/mongo/shell/assert.js:6:5)
        at Function.assert.soon (src/mongo/shell/assert.js:110:60)
        at Function.sh.waitForDLock (src/mongo/shell/utils_sh.js:156:12)
        at Function.sh.waitForBalancerOff (src/mongo/shell/utils_sh.js:224:12)
        at Function.sh.waitForBalancer (src/mongo/shell/utils_sh.js:254:12)
        at Function.sh.stopBalancer (src/mongo/shell/utils_sh.js:126:8)
        at (shell):1:4
    Balancer still may be active, you must manually verify this is not the case using the config.changelog collection.
    Thu Aug 21 23:01:13.142 assert.soon failed: function (){
            var lock = db.getSisterDB( "config" ).locks.findOne({ _id : lockId })
     
            if( state == false ) return ! lock || lock.state == 0
            if( state == true ) return lock && lock.state == 2
            if( state == undefined ) return (beginTS == undefined && lock) ||
                                            (beginTS != undefined && ( !lock || lock.ts + "" != beginTS + "" ) )
            else return lock && lock.state == state
        }, msg:Waited too long for lock balancer to unlock at src/mongo/shell/utils_sh.js:228

    I also see the following:

    > use config
    switched to db config
    > db.locks.find({state:2}, {_id:1, why:1})
    {
      "_id": "balancer",
      "why": "doing balance round"
    }

  • a collection lock may get stuck, resulting in the balancer failing to move chunks. This leads to errors in the mongod.log files:

    Thu Aug 21 23:26:35.362 [conn5] about to log metadata event: { _id: "hal.local-2014-08-21T22:26:35-53f6721b510fb7b6210e5146", server: "hal.local", clientAddr: "192.168.1.129:53350", time: new Date(1408659995362), what: "moveChunk.from", ns: "foo.bar", details: { min: { _id: 2.0 }, max: { _id: 3.0 }, step1 of 6: 0, note: "aborted" } }

    Similarly the changelog collection shows aborted messages:

    > db.changelog.find({}, {"details.note":1}).sort({time:-1}).limit(1)
    {
      "_id": "hal.local-2014-08-21T22:37:18-53f6749e510fb7b6210e51b1",
      "details": {
        "note": "aborted"
      },
      "time": ISODate("2014-08-21T22:37:18.137Z")
    }

    The locks collection also shows entries similar to:

    > db.locks.find({state:2}, {_id:1, why:1})
    {
      "_id": "foo.bar",
      "why": "migrate-{ _id: 2.0 }"
    }

The behavior in 2.4.x vs 2.6.x is a little different:

  • 2.4.x seems quite susceptible to the problem and it can be triggered quite easily using the repro steps provided.
  • 2.6.x seems to recover from stale locks shortly after the restart of the config server in most instances but there have been rare occasions where it also maintains a lock after the restart. When there are stale locks in 2.6.x it typically relates to just the balancer lock.

While my reproduction steps involve taking down the config server it is also likely that the same problem could occur if there are network issues between cluster members and the config server


Generated at Thu Feb 08 03:36:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.