Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-2985

Rebalancing too slow and moveChunk is blocked by balancer lock

    • Type: Icon: Question Question
    • Resolution: Incomplete
    • Priority: Icon: Critical - P2 Critical - P2
    • None
    • Affects Version/s: 1.8.1
    • Component/s: Admin, Sharding
    • Environment:
      OS Centos 5.4. Host HW dual socket Nehailm 4 cores, 36GB memory 24 1TB disks in Raid10 configuration. New Shard has 64GB of memory with 12 300 GB disks in Raid 10 configuration.

      We have added a new shard to a 4 shard cluster making it 5 shards. The cluster is under a very light workload. Watching the load balancer it would appear that its going to take 2-3 days to complete rebalancing the shards.

      > db.printShardingStatus();
      — Sharding Status —
      sharding version:

      { "_id" : 1, "version" : 3 }

      shards:
      {
      "_id" : "repset_a",
      "host" : "repset_a/lmdb-m03.mail.aol.com:7312,lmdb-d02.mail.aol.com:7312,lmdb-d01.mail.aol.com:7312"
      }
      {
      "_id" : "repset_b",
      "host" : "repset_b/lmdb-d05.mail.aol.com:7312,lmdb-m06.mail.aol.com:7312,lmdb-d04.mail.aol.com:7312"
      }
      {
      "_id" : "repset_c",
      "host" : "repset_c/lmdb-d03.mail.aol.com:7312,lmdb-m02.mail.aol.com:7312,lmdb-m01.mail.aol.com:7312"
      }
      {
      "_id" : "repset_d",
      "host" : "repset_d/lmdb-d06.mail.aol.com:7312,lmdb-m05.mail.aol.com:7312,lmdb-m04.mail.aol.com:7312"
      }
      {
      "_id" : "repset_e",
      "host" : "repset_e/lmdb-d08.mail.aol.com:7312,lmdb-m09.mail.aol.com:7312,lmdb-d07.mail.aol.com:7312"
      }
      databases:

      { "_id" : "admin", "partitioned" : false, "primary" : "config" } { "_id" : "MigOidDB", "partitioned" : true, "primary" : "repset_a" }

      MigOidDB.MigOidCol chunks:
      repset_e 205
      repset_c 1283
      repset_a 1283
      repset_d 1283
      repset_b 1283
      too many chunksn to print, use verbose if you want to force print

      { "_id" : "test", "partitioned" : false, "primary" : "repset_a" } { "_id" : "local", "partitioned" : false, "primary" : "repset_a" } { "_id" : "MigOidCol", "partitioned" : false, "primary" : "repset_a" }

      We have tried using moveChunk to speed the process up but the load balancer has a "Metadata Lock" on the collection and will not allow us to do a manual moveChunk.

      > db.adminCommand({moveChunk : "MigOidDB.MigOidCol", find : {_id : "buggzeeann_30324171"}, to : "repset_e"});
      {
      "cause" : {
      "who" : {
      "_id" : "MigOidDB.MigOidCol",
      "process" : "lmdb-d02.mail.aol.com:1303133674:1828878975",
      "state" : 1,
      "ts" : ObjectId("4db174cf7a18929e98aa4f5b"),
      "when" : ISODate("2011-04-22T12:30:07.322Z"),
      "who" : "lmdb-d02.mail.aol.com:1303133674:1828878975:conn34711:1543737603",
      "why" : "migrate-

      { _id: \"ballc21_28406862\" }

      "
      },
      "errmsg" : "the collection's metadata lock is taken",
      "ok" : 0
      },
      "ok" : 0,
      "errmsg" : "move failed"
      }

            Assignee:
            Unassigned Unassigned
            Reporter:
            john.schulz@teamaol.com John Schulz
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved: