Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-14996

Locks not released if config server goes down / balancer operation and moveChunk command stop working

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.4.8, 2.4.10, 2.6.4
    • Fix Version/s: None
    • Component/s: Sharding
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide
      1. Launch a simple sharded cluster

         mlaunch --single --sharded 3
         

      2. Start the repro.sh script (attached)
        This prints a '.' continuously until it finds any lock in state 2 (triggered by the code in the next step).
        It then stops and restarts the config server after which it enters another loop printing the # of locks it finds in state 2.
        The loop terminates if and when the find operation returns no acquired locks. Often this loop never terminates implying there are stale locks
      3. Connect to the mongos and run the following code to trigger the balancer:

         o = new Array(1024*1024*4).join('.')
         use foo
         db.bar.insert({data:o})
         sh.enableSharding("foo.bar")
         sh.shardCollection("foo.bar",{_id:1})
         for (var i = 0; i < 1000; i++) db.bar.insert({data:o})

      4. After (hopefully) just a few iterations of the loop the repro.sh script should stop & restart the config server.
        Once the config server restarts there should be some locks stuck in state 2. This is shown by a continuous stream of numbers being output, ie the script doesn't terminate as these locks are in a constant acquired state.

      Sometimes it may be necessary to run the repro.sh script more than once for it to demonstrate the problem.

      Typically I'm seeing 1 lock left in state 2, sometimes 2, when the config server restarts - sometimes (rarely) it's the balancer lock itself, sometimes it's a collection lock.

      When the balancer lock is stuck it becomes impossible to start/stop the balancer, ie sh.stopBalancer() will timeout.

      When a collection lock is stuck subsequent attempts to move chunks fail, e.g. the following code fails:

      sh.stopBalancer()
      db.getSisterDB("config").chunks.find().forEach(function(doc) { db.runCommand({moveChunk: "foo.bar", bounds: [{_id:doc.min._id},{_id:doc.max._id}], to: "shard0000"})});
      sh.startBalancer()

      It results in aborted messages in the mongod.log files and aborted entries in the changelog collection. And while the balancer can start & stop in this state it too is unable to move chunks, effectively disabling it ability to balance the cluster.

      Show
      Launch a simple sharded cluster mlaunch --single --sharded 3 Start the repro.sh script (attached) This prints a '.' continuously until it finds any lock in state 2 (triggered by the code in the next step). It then stops and restarts the config server after which it enters another loop printing the # of locks it finds in state 2. The loop terminates if and when the find operation returns no acquired locks. Often this loop never terminates implying there are stale locks Connect to the mongos and run the following code to trigger the balancer : o = new Array(1024*1024*4).join('.') use foo db.bar.insert({data:o}) sh.enableSharding("foo.bar") sh.shardCollection("foo.bar",{_id:1}) for (var i = 0; i < 1000; i++) db.bar.insert({data:o}) After (hopefully) just a few iterations of the loop the repro.sh script should stop & restart the config server. Once the config server restarts there should be some locks stuck in state 2. This is shown by a continuous stream of numbers being output, ie the script doesn't terminate as these locks are in a constant acquired state. Sometimes it may be necessary to run the repro.sh script more than once for it to demonstrate the problem. Typically I'm seeing 1 lock left in state 2, sometimes 2, when the config server restarts - sometimes (rarely) it's the balancer lock itself, sometimes it's a collection lock. When the balancer lock is stuck it becomes impossible to start/stop the balancer , ie sh.stopBalancer() will timeout. When a collection lock is stuck subsequent attempts to move chunks fail, e.g. the following code fails: sh.stopBalancer() db.getSisterDB("config").chunks.find().forEach(function(doc) { db.runCommand({moveChunk: "foo.bar", bounds: [{_id:doc.min._id},{_id:doc.max._id}], to: "shard0000"})}); sh.startBalancer() It results in aborted messages in the mongod.log files and aborted entries in the changelog collection. And while the balancer can start & stop in this state it too is unable to move chunks, effectively disabling it ability to balance the cluster.

      Description

      If the (primary) config server goes down while the balancer is running it can result in stale locks which don't get released even after the config server is brought back up. This can result in a failure to start/stop the balancer and an inability to perform moveChunk commands. Manual intervention is then required to verify the locks are stale and to release them.

      There appear to be a few scenarios:

      • In rare cases the balancer lock may get stuck, resulting in an inability to start/stop the balancer, ie sh.stopBalancer() will timeout and display an error:

        > sh.stopBalancer()
        Updated 1 new record(s) in 7ms
        Waiting for active hosts...
        Waiting for the balancer lock...
        assert.soon failed: function (){
                var lock = db.getSisterDB( "config" ).locks.findOne({ _id : lockId })
         
                if( state == false ) return ! lock || lock.state == 0
                if( state == true ) return lock && lock.state == 2
                if( state == undefined ) return (beginTS == undefined && lock) ||
                                                (beginTS != undefined && ( !lock || lock.ts + "" != beginTS + "" ) )
                else return lock && lock.state == state
            }, msg:Waited too long for lock balancer to unlock
        Error: Printing Stack Trace
            at printStackTrace (src/mongo/shell/utils.js:37:15)
            at doassert (src/mongo/shell/assert.js:6:5)
            at Function.assert.soon (src/mongo/shell/assert.js:110:60)
            at Function.sh.waitForDLock (src/mongo/shell/utils_sh.js:156:12)
            at Function.sh.waitForBalancerOff (src/mongo/shell/utils_sh.js:224:12)
            at Function.sh.waitForBalancer (src/mongo/shell/utils_sh.js:254:12)
            at Function.sh.stopBalancer (src/mongo/shell/utils_sh.js:126:8)
            at (shell):1:4
        Balancer still may be active, you must manually verify this is not the case using the config.changelog collection.
        Thu Aug 21 23:01:13.142 assert.soon failed: function (){
                var lock = db.getSisterDB( "config" ).locks.findOne({ _id : lockId })
         
                if( state == false ) return ! lock || lock.state == 0
                if( state == true ) return lock && lock.state == 2
                if( state == undefined ) return (beginTS == undefined && lock) ||
                                                (beginTS != undefined && ( !lock || lock.ts + "" != beginTS + "" ) )
                else return lock && lock.state == state
            }, msg:Waited too long for lock balancer to unlock at src/mongo/shell/utils_sh.js:228

        I also see the following:

        > use config
        switched to db config
        > db.locks.find({state:2}, {_id:1, why:1})
        {
          "_id": "balancer",
          "why": "doing balance round"
        }

      • a collection lock may get stuck, resulting in the balancer failing to move chunks. This leads to errors in the mongod.log files:

        Thu Aug 21 23:26:35.362 [conn5] about to log metadata event: { _id: "hal.local-2014-08-21T22:26:35-53f6721b510fb7b6210e5146", server: "hal.local", clientAddr: "192.168.1.129:53350", time: new Date(1408659995362), what: "moveChunk.from", ns: "foo.bar", details: { min: { _id: 2.0 }, max: { _id: 3.0 }, step1 of 6: 0, note: "aborted" } }

        Similarly the changelog collection shows aborted messages:

        > db.changelog.find({}, {"details.note":1}).sort({time:-1}).limit(1)
        {
          "_id": "hal.local-2014-08-21T22:37:18-53f6749e510fb7b6210e51b1",
          "details": {
            "note": "aborted"
          },
          "time": ISODate("2014-08-21T22:37:18.137Z")
        }

        The locks collection also shows entries similar to:

        > db.locks.find({state:2}, {_id:1, why:1})
        {
          "_id": "foo.bar",
          "why": "migrate-{ _id: 2.0 }"
        }

      The behavior in 2.4.x vs 2.6.x is a little different:

      • 2.4.x seems quite susceptible to the problem and it can be triggered quite easily using the repro steps provided.
      • 2.6.x seems to recover from stale locks shortly after the restart of the config server in most instances but there have been rare occasions where it also maintains a lock after the restart. When there are stale locks in 2.6.x it typically relates to just the balancer lock.

      While my reproduction steps involve taking down the config server it is also likely that the same problem could occur if there are network issues between cluster members and the config server

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: