chunk migration stuck

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: 2.4.4
    • Component/s: Sharding
    • None
    • Environment:
      Ubuntu
    • Linux
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Setup: 3 config servers, 3 mongod and 3 mongos.

      Last night I sharded 3 collections (hashed), one after the other. Documents seemed to move from the main shard to the others as time passed by. This morning, I checked if the chunk migrations have finished and it seemed like for the first two they have, but the third got stuck. I looked at 'sh.status()', and it showed this:

      abc.ankara_extractor
              shard key: { "_id" : "hashed" }
              chunks:
                      shard0000       35
                      shard0001       31
                      shard0002       32
      
      abc.ankara_parser
              shard key: { "_id" : "hashed" }
              chunks:
                      shard0000       49
                      shard0002       39
                      shard0001       39
      

      Not perfectly balanced, but whatever.

      Moving on to the third collection, nothing changed:

      abc.jakarta_companies
              shard key: { "_id" : "hashed" }
              chunks:
                      shard0000       143
      

      At this point I stopped all mongos instances, and started just one. It showed this in the log:

      Fri Sep 13 08:38:59.567 [Balancer]  ns: abc.ankara_parser going to move { _id: "abc.ankara_parser-_id_MinKey", lastmod: Timestamp 78000|1, lastmodEpoch: ObjectId('523220e66f82e5ddc385722d'), ns: "abc.ankara_parser", min: { _id: MinKey }, max: { _id: -9077305884563950534 }, shard: "shard0000" } from: shard0000 to: shard0001 tag []
      Fri Sep 13 08:38:59.570 [Balancer]  ns: abc.jakarta_companies going to move { _id: "abc.jakarta_companies-_id_MinKey", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('523222126f82e5ddc3857253'), ns: "abc.jakarta_companies", min: { _id: MinKey }, max: { _id: -9094728535907602479 }, shard: "shard0000" } from: shard0000 to: shard0001 tag []
      Fri Sep 13 08:38:59.571 [Balancer] moving chunk ns: abc.ankara_parser moving ( ns:abc.ankara_parsershard: shard0000:mongod1.abc.com:27018lastmod: 78|1||000000000000000000000000min: { _id: MinKey }max: { _id: -9077305884563950534 }) shard0000:mongod1.abc.com:27018 -> shard0001:mongod2.abc.com:27018
      Fri Sep 13 08:38:59.656 [Balancer] moveChunk result: { who: { _id: "abc.ankara_parser", process: "ip-10-151-124-58:27018:1378994641:1905410576", state: 2, ts: ObjectId('5232610423a6193a37ce7a0a'), when: new Date(1379033348967), who: "ip-10-151-124-58:27018:1378994641:1905410576:conn751:1336024233", why: "migrate-{ _id: 7965667869890547469 }" }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: MinKey }" }
      Fri Sep 13 08:38:59.657 [Balancer] balancer move failed: { who: { _id: "abc.ankara_parser", process: "ip-10-151-124-58:27018:1378994641:1905410576", state: 2, ts: ObjectId('5232610423a6193a37ce7a0a'), when: new Date(1379033348967), who: "ip-10-151-124-58:27018:1378994641:1905410576:conn751:1336024233", why: "migrate-{ _id: 7965667869890547469 }" }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: MinKey }" } from: shard0000 to: shard0001 chunk:  min: { _id: MinKey } max: { _id: -9077305884563950534 }
      Fri Sep 13 08:38:59.657 [Balancer] moving chunk ns: abc.jakarta_companies moving ( ns:abc.jakarta_companiesshard: shard0000:mongod1.abc.com:27018lastmod: 1|0||000000000000000000000000min: { _id: MinKey }max: { _id: -9094728535907602479 }) shard0000:mongod1.abc.com:27018 -> shard0001:mongod2.abc.com:27018
      

      Looking at jakarta_companies.count() in the 3 mongods, shows:

      shard0000: 3895500
      shard0001: 27257
      shard0002: collection doesn't exist
      

      and it's been stuck like this since then.

      Other info that might be useful:

      mongos> db.currentOp()                                                                                                                                                                                      
      {
              "inprog" : [
                      {
                              "opid" : "shard0000:88205450",
                              "active" : true,
                              "secs_running" : 4510,
                              "op" : "query",
                              "ns" : "abc.jakarta_companies",
                              "query" : {
                                      "moveChunk" : "abc.jakarta_companies",
                                      "from" : "mongod1.abc.com:27018",
                                      "to" : "mongod2.abc.com:27018",
                                      "fromShard" : "shard0000",
                                      "toShard" : "shard0001",
                                      "min" : {
                                              "_id" : { "$minKey" : 1 }
                                      },
                                      "max" : {
                                              "_id" : NumberLong("-9094728535907602479")
                                      },
                                      "maxChunkSizeBytes" : NumberLong(67108864),
                                      "shardId" : "abc.jakarta_companies-_id_MinKey",
                                      "configdb" : "mongoc1.abc.com:27019,mongoc2.abc.com:27019,mongoc3.abc.com:27019",
                                      "secondaryThrottle" : true,
                                      "waitForDelete" : false
                              },
                              "client_s" : "10.151.90.78:33638",
                              "desc" : "conn755",
                              "threadId" : "0x7e3fbb5ca700",
                              "connectionId" : 755,
                              "waitingForLock" : false,
                              "msg" : "step3 of 6",
                              "numYields" : 0,
                              "lockStats" : {
                                      "timeLockedMicros" : {
                                              "r" : NumberLong(241859),
                                              "w" : NumberLong(0)
                                      },
                                      "timeAcquiringMicros" : {
                                              "r" : NumberLong(6),
                                              "w" : NumberLong(0)
                                      }
                              }
                      }
              ]
      }
      
      configsvr> db.locks.find()
      { "_id" : "configUpgrade", "process" : "ip-10-36-21-61:27017:1372080743:1804289383", "state" : 0, "ts" : ObjectId("51c84a67f095c43a2590802d"), "when" : ISODate("2013-06-24T13:32:23.851Z"), "who" : "ip-10-36-21-61:27017:1372080743:1804289383:mongosMain:846930886", "why" : "upgrading config database to new format v4" }
      { "_id" : "abc.scraped_website_resource", "process" : "ip-10-151-25-52:27018:1372842489:2092651466", "state" : 0, "ts" : ObjectId("51fe0eb2bd9a06e052152f3c"), "when" : ISODate("2013-08-04T08:20:02.526Z"), "who" : "ip-10-151-25-52:27018:1372842489:2092651466:conn15229:2046301240", "why" : "migrate-{ _id: MinKey }" }
      { "_id" : "abc.parsed_website_resource", "process" : "ip-10-190-202-133:27018:1378999289:1601254270", "state" : 0, "ts" : ObjectId("52321fdf0fe09cb3ce334466"), "when" : ISODate("2013-09-12T20:11:11.607Z"), "who" : "ip-10-190-202-133:27018:1378999289:1601254270:conn672:1819358977", "why" : "migrate-{ _id: -9221392181608325863 }" }
      { "_id" : "abc.website_catalog", "process" : "ip-10-190-202-133:27018:1372842483:2063823080", "state" : 0, "ts" : ObjectId("51f64ff6615675a179f8fbb1"), "when" : ISODate("2013-07-29T11:20:22.627Z"), "who" : "ip-10-190-202-133:27018:1372842483:2063823080:conn22035:1126107449", "why" : "migrate-{ _id: MinKey }" }
      { "_id" : "balancer", "process" : "ip-10-151-90-78:27017:1379061538:1804289383", "state" : 2, "ts" : ObjectId("5232cf22e84b73ca42025328"), "when" : ISODate("2013-09-13T08:38:58.442Z"), "who" : "ip-10-151-90-78:27017:1379061538:1804289383:Balancer:846930886", "why" : "doing balance round" }
      { "_id" : "abc.santiago_extractor", "process" : "ip-10-151-124-58:27018:1372842478:1247094076", "state" : 0, "ts" : ObjectId("51d97b45de466c4e9210a0d3"), "when" : ISODate("2013-07-07T14:29:25.296Z"), "who" : "ip-10-151-124-58:27018:1372842478:1247094076:conn16061:1645155834", "why" : "migrate-{ _id: MinKey }" }
      { "_id" : "abc.santiago_parser", "process" : "ip-10-190-202-133:27018:1372842483:2063823080", "state" : 0, "ts" : ObjectId("51d3ec4d615675a179f8ed66"), "when" : ISODate("2013-07-03T09:18:05.062Z"), "who" : "ip-10-190-202-133:27018:1372842483:2063823080:conn15192:1397128088", "why" : "split-{ _id: 6148914691236517204 }" }
      { "_id" : "abc.jakarta_contacts", "process" : "ip-10-151-124-58:27018:1372842478:1247094076", "state" : 0, "ts" : ObjectId("51fe51f1de466c4e9210c20a"), "when" : ISODate("2013-08-04T13:06:57.282Z"), "who" : "ip-10-151-124-58:27018:1372842478:1247094076:conn290924:508144576", "why" : "migrate-{ data.companyId: 2958941370527843639 }" }
      { "_id" : "abc.feature_results_test", "process" : "ip-10-151-2-120:27017:1377505487:1804289383", "state" : 0, "ts" : ObjectId("523217c68ff15d76d1012d9c"), "when" : ISODate("2013-09-12T19:36:38.624Z"), "who" : "ip-10-151-2-120:27017:1377505487:1804289383:conn412:1714636915", "why" : "drop" }
      { "_id" : "abc.ankara_extractor", "process" : "ip-10-151-124-58:27018:1378994641:1905410576", "state" : 0, "ts" : ObjectId("52325fbd23a6193a37ce7a02"), "when" : ISODate("2013-09-13T00:43:41.655Z"), "who" : "ip-10-151-124-58:27018:1378994641:1905410576:conn744:621698639", "why" : "migrate-{ _id: 9200307451568432088 }" }
      { "_id" : "abc.ankara_parser", "process" : "ip-10-151-124-58:27018:1378994641:1905410576", "state" : 0, "ts" : ObjectId("5232610423a6193a37ce7a0a"), "when" : ISODate("2013-09-13T00:49:08.967Z"), "who" : "ip-10-151-124-58:27018:1378994641:1905410576:conn751:1336024233", "why" : "migrate-{ _id: 7965667869890547469 }" }
      { "_id" : "abc.jakarta_companies", "process" : "ip-10-151-124-58:27018:1378994641:1905410576", "state" : 2, "ts" : ObjectId("5232cf1923a6193a37ce7a0f"), "when" : ISODate("2013-09-13T08:38:49.804Z"), "who" : "ip-10-151-124-58:27018:1378994641:1905410576:conn755:78850494", "why" : "migrate-{ _id: MinKey }" }
      

      Since then I've tried restarting mongos, the config servers, restarting the balancer (sh.stop/startBalancer). Nothing causes the migration to resume.

      How do I get out of this jam?

        1. mongoc.tar.gz
          1.86 MB
        2. mongoc.tar.gz
          1.86 MB
        3. mongos.log
          7 kB
        4. shard0000-log.gz
          582 kB
        5. shard0001-log
          274 kB

              Assignee:
              Unassigned
              Reporter:
              Idan Kamara
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: