Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24358

mongodb removeShard wedged if computed average record size is zero

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.0.7, 3.0.12
    • Component/s: Sharding
    • Labels:
      None
    • Sharding
    • ALL

      We have a setup currently with 3x shards that we are trying to consolidate down to 1x shard. After reviewing the documentation we started up the process and it seemed to go fine and spent the next several days draining the chunks.

      It has been stuck for a long time at this point:

      db.getSiblingDB('admin').runCommand( { removeShard: "shard0001" } );
      {
        "msg": "draining ongoing",
        "state": "ongoing",
        "remaining": {
          "chunks": NumberLong("8"),
          "dbs": NumberLong("0")
        },
        "ok": 1
      }
      

      After a lot of reading it would seem that this is because some of the chunks are indivisible or too big. Looking at the output of

      db.chunks.find({shard: "shard0001"})
      

      shows that they are all marked as jumbo, which makes sense from the docs. However, running this shows that they aren't big, just maybe indivisible:

      interface> db.getSiblingDB("config").chunks.find({ns : "interface.data_3600", shard: "shard0001"}).forEach(function(chunk) {         var ds = db.getSiblingDB("interface.data_3600", ).runCommand({datasize:chunk.ns,keyPattern:key,min:chunk.min,max:chunk.max});         print("Chunk: "+chunk._id +" has a size of "+ds.size+", and
      includes "+ds.numObjects+" objects (took "+ds.millis+"ms)")     } );
      

      Output:

      Chunk: interface.data_3600-identifier_"eb680bf0edff481d8029cda660b46438157ec1af94850abf62bda93a6989f0c7"start_1443600000end_1447200000 has a size of 0, and includes 0 objects (took 364ms)
      Chunk: interface.data_3600-identifier_"f3bd3220e046b1093b1c976edf4bbdd0a80e11d3a1c08af483654e1929a4617b"start_1447200000end_1450800000 has a size of 0, and includes 0 objects (took 67ms)
      Chunk: interface.data_3600-identifier_"f58989626ca054a08a89291eb888ad55997cc44ede5a55bf2723f2a9a6d7dadf"start_1447200000end_1450800000 has a size of 1319889, and includes 4 objects (took 290ms)
      Chunk: interface.data_3600-identifier_"f608484cf555af6a1dcd6ef85e24c0b4ca9880b2c8fd459beedd9a6b158bac31"start_1447200000end_1450800000 has a size of 1715344, and includes 5 objects (took 193ms)
      Chunk: interface.data_3600-identifier_"f85d1cb52fbda215a828f994aa7da465892c9aa9d6a423d3555797b9a394258e"start_1447200000end_1450800000 has a size of 1888620, and includes 4 objects (took 102ms)
      Chunk: interface.data_3600-identifier_"fc89195d311b61c3d38f38ee3d29dcb111a4a4752ca460eb07c305220f45a720"start_1450800000end_1454400000 has a size of 0, and includes 0 objects (took 0ms)
      Chunk: interface.data_3600-identifier_"ff26d379eb71a09c56438310289ad4a6556e7e4e4f3b6a14f1a04411c599ae66"start_1447200000end_1450800000 has a size of 1869036, and includes 4 objects (took 70ms)
      Chunk: interface.data_3600-identifier_"ffff18518b122a12d14aa6c229c96fb824ae0a04328a80efeb87585deda6920b"start_1450800000end_1454400000 has a size of 1525992, and includes 4 objects (took 49ms)
      

      These all seem to be empty or nearly empty chunks. There was a period where we did a lot of deletions of data, so this may be left over from that. I cannot seem to get these to move, however, and finish the draining.

      Trying to perform a manual move resulted in the following:

      {
      
        "cause": {
          "chunkTooBig": true,
          "estimatedChunkSize": NumberLong("169236184162472618"),
          "ok": 0,
          "errmsg": "chunk too big to move"
        },
        "ok": 0,
        "errmsg": "move failed"
      }
      

      After some more reading, someone suggested deleted the documents in the chunk, clearing the jumbo flag manually and then moving it. Doing that and running a manual move now shows:

      sh.moveChunk("interface.data_3600", {"identifier": "f3bd3220e046b1093b1c976edf4bbdd0a80e11d3a1c08af483654e1929a4617b", "start": NumberLong("1447200000"), "end": NumberLong("1450800000")}, "shard0000");
      {
        "cause": {
          "ok": 0,
          "errmsg": "could not acquire collection lock for interface.data_3600 to migrate chunk [{ : MinKey },{ : MaxKey }) :: caused by :: Lock for migrating chunk [{ : MinKey }, { : MaxKey }) in interface.data_3600 is taken."
        },
        "ok": 0,
        "errmsg": "move failed"
      }
      

      The lock in question seems to be perpetually stuck, and the regular balancer is always logging the same thing. In the interest of debugging I tried manually clearing the lock and running the command, at which point the migrate FROM shard segfaulted with this:

      2016-06-02T13:41:01.534+0000 I SHARDING [conn3] received moveChunk request: { moveChunk: "interface.data_3600", from: "tsds.bldc.grnoc.iu.edu:27026", to: "tsds.bldc.grnoc.iu.edu:27025", fromShard: "shard0001", toShard: "shard0000", min: { identifier: "eb680bf0edff481d8029cda660b46438157ec1af94850abf62bda93a6989f0c7", start: 14
      43600000, end: 1447200000 }, max: { identifier: "eb680bf0edff481d8029cda660b46438157ec1af94850abf62bda93a6989f0c7", start: 1447200000, end: 1450800000 }, maxChunkSizeBytes: 268435456, shardId: "interface.data_3600-identifier_"eb680bf0edff481d8029cda660b46438157ec1af94850abf62bda93a6989f0c7"start_1443600000end_1447200000", conf
      igdb: "tsds.bldc.grnoc.iu.edu:27019,tsds.bldc.grnoc.iu.edu:27020,tsds.bldc.grnoc.iu.edu:27021", secondaryThrottle: true, waitForDelete: false, maxTimeMS: 0, epoch: ObjectId('5592cb956e74ba75920e0bda') }
      2016-06-02T13:41:01.797+0000 I SHARDING [conn3] distributed lock 'interface.data_3600/tsds.bldc.grnoc.iu.edu:27026:1464874713:1439290629' acquired, ts : 5750376d7a625e027e304639
      2016-06-02T13:41:01.797+0000 I SHARDING [conn3] about to log metadata event: { _id: "tsds.bldc.grnoc.iu.edu-2016-06-02T13:41:01-5750376d7a625e027e30463a", server: "tsds.bldc.grnoc.iu.edu", clientAddr: "140.182.45.176:52095", time: new Date(1464874861797), what: "moveChunk.start", ns: "interface.data_3600", details: { min: { id
      entifier: "eb680bf0edff481d8029cda660b46438157ec1af94850abf62bda93a6989f0c7", start: 1443600000, end: 1447200000 }, max: { identifier: "eb680bf0edff481d8029cda660b46438157ec1af94850abf62bda93a6989f0c7", start: 1447200000, end: 1450800000 }, from: "shard0001", to: "shard0000" } }
      2016-06-02T13:41:01.799+0000 I SHARDING [conn3] remotely refreshing metadata for interface.data_3600, current shard version is 0|0||000000000000000000000000, current metadata version is 0|0||000000000000000000000000
      2016-06-02T13:41:01.822+0000 I SHARDING [conn3] collection interface.data_3600 was previously unsharded, new metadata loaded with shard version 4642|1||5592cb956e74ba75920e0bda
      2016-06-02T13:41:01.822+0000 I SHARDING [conn3] collection version was loaded at version 4642|1||5592cb956e74ba75920e0bda, took 23ms
      2016-06-02T13:41:01.822+0000 I SHARDING [conn3] moveChunk request accepted at version 4642|1||5592cb956e74ba75920e0bda
      2016-06-02T13:41:01.822+0000 F -        [conn3] Invalid operation at address: 0xdfa98e
      2016-06-02T13:41:01.828+0000 F -        [conn3] Got signal: 8 (Floating point exception).
      
       0xf758c9 0xf74f42 0xf7529e 0x3fda60f7e0 0xdfa98e 0xdfcef4 0x9cf824 0x9d07ad 0x9d14bb 0xba2f4a 0xab4830 0x7fbcfd 0xf26b9b 0x3fda607aa1 0x3fda2e893d
      ----- BEGIN BACKTRACE -----
      {"backtrace":[{"b":"400000","o":"B758C9"},{"b":"400000","o":"B74F42"},{"b":"400000","o":"B7529E"},{"b":"3FDA600000","o":"F7E0"},{"b":"400000","o":"9FA98E"},{"b":"400000","o":"9FCEF4"},{"b":"400000","o":"5CF824"},{"b":"400000","o":"5D07AD"},{"b":"400000","o":"5D14BB"},{"b":"400000","o":"7A2F4A"},{"b":"400000","o":"6B4830"},{"b"
      :"400000","o":"3FBCFD"},{"b":"400000","o":"B26B9B"},{"b":"3FDA600000","o":"7AA1"},{"b":"3FDA200000","o":"E893D"}],"processInfo":{ "mongodbVersion" : "3.0.7", "gitVersion" : "6ce7cbe8c6b899552dadd907604559806aa2e9bd", "uname" : { "sysname" : "Linux", "release" : "2.6.32-504.23.4.el6.x86_64", "version" : "#1 SMP Tue Jun 9 20:57:
      37 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "E359E52F4A5E2A16804BCA34965CC29081397CDF" }, { "b" : "7FFF3C0CD000", "elfType" : 3, "buildId" : "FF4CBAAE51A93124ED31C2B1386CE92FF24AEBC3" }, { "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "C56DD1B811FC0D9263248EBB3
      08C73FCBCD80FC1" }, { "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "40BEA6554E64FC0C3D5C7D0CD91362730515102F" }, { "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "FC4EFD7502ACB3B9D213D28272D15A165857AD5A" }, { "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "95159178F1A4A3DBDC7819F
      BEA2C80E5FCDD6BAC" }, { "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "29B61382141595ECBA6576232E44F2310C3AAB72" }, { "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "F07F2E7CF4BFB393CC9BBE8CDC6463652E14DB07" }, { "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "989FE3A42CA8CEBDCC185A743896F
      23A0CF537ED" }, { "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "246C3BAB0AB093AFD59D34C8CBF29E786DE4BE97" }, { "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "8E6FA4C4B0594C355C1B90C1D49990368C81A040" }, { "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "959C5E10A47EE8A633E7681B64B4B9
      F74E242ED5" }, { "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "54BA6B78A9220344E77463947215E42F0EABCC62" }, { "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "6797403AA5F8FAD8ADFF683478B45F528CE4FB0E" }, { "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "8CE28F280150E62296240E70ECA
      C64E4A57AB826" }, { "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "05733977F4E41652B86070B27A0CFC2C1EA7719D" }, { "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "E3FA235F3BA3F776A01A18ECA7
      37C9890F445923" }, { "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "C39D7FFB49DFB1B55AD09D1D711AD802123F6623" }, { "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "E6798A06BEE17CF102BBA44F
      D512FF8B805CEAF1" } ] }}
       mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf758c9]
       mongod(+0xB74F42) [0xf74f42]
       mongod(+0xB7529E) [0xf7529e]
       libpthread.so.0(+0xF7E0) [0x3fda60f7e0]
       mongod(_ZN5mongo17MigrateFromStatus16storeCurrentLocsEPNS_16OperationContextExRSsRNS_14BSONObjBuilderE+0x36E) [0xdfa98e]
       mongod(_ZN5mongo16MoveChunkCommand3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x1CD4) [0xdfcef4]
       mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9cf824]
       mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC1D) [0x9d07ad]
       mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9d14bb]
       mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3_+0x77A) [0xba2f4a]
       mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xB10) [0xab4830]
       mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xDD) [0x7fbcfd]
       mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x34B) [0xf26b9b]
       libpthread.so.0(+0x7AA1) [0x3fda607aa1]
       libc.so.6(clone+0x6D) [0x3fda2e893d]
      -----  END BACKTRACE  -----
      

      Upon restart of the process, the lock goes back into a perpetually locked state again.

      2016-06-02T13:59:58.908+0000 W SHARDING [conn3] could not acquire collection lock for interface.data_3600 to migrate chunk [{ : MinKey },{ : MaxKey }) :: caused by :: Lock for migrating chunk [{ : MinKey }, { : MaxKey }) in interface.data_3600 is taken
      

      All processes involved have been restarted in case something was in a weird state that would be resolved by that. Our environment is mongodb 3.0.7 on CentOS6 64bit. At this point upgrading is not a possibility due to driver concerns, and we are trying to avoid big downtimes as it is a production service. The data set is somewhat large (> 1 TB).

      Any help would be appreciated. If I can provide anything else I am more than happy to.

            Assignee:
            backlog-server-sharding [DO NOT USE] Backlog - Sharding Team
            Reporter:
            daldoyle Dan Doyle
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: