Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22290

endless "moveChunk failed, because there are still n deletes from previous migration"

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Incomplete
    • Affects Version/s: 2.6.10
    • Fix Version/s: None
    • Component/s: Sharding
    • Labels:
      None
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      Create 1.000 new empty chunks and try to move them among all shards.

      Show
      Create 1.000 new empty chunks and try to move them among all shards.

      Description

      We have several sharded clusters running mongodb v2.6.10. We regularely pre-split so that all new documents will be inserted evenly among all shards. The balancer is always off because we prefer to distribute our documents manually dependent on the hardware (RAM) since not all shards have the same amount of RAM.

      In order to pre-split, we create new chunks with sh.splitAt. Once hundreds or thousands new empty chunks are created, we move them from the origin shard evenly distributed to the other shards by using sh.moveChunk. This should be a very quick operation because the chunks to move are empty.

      From time to time we encounter the following error. The bigger the cluster, the more often the error seems to happen.

      {
              "cause" : {
                      "cause" : {
                              "ok" : 0,
                              "errmsg" : "can't accept new chunks because  there are still 8 deletes from previous migration"
                      },
                      "ok" : 0,
                      "errmsg" : "moveChunk failed to engage TO-shard in the data transfer: can't accept new chunks because  there are still 8 deletes from previous migration"
              },
              "ok" : 0,
              "errmsg" : "move failed"
      }

      The error may also happen after all shards have received already empty chunks, so they already have accepted new chunks. However, some seconds later they refuse new chunks, telling that "there are still n deletes from previous migration" even though all previous received chunks were all empty! This seems very illogical for us. Can you explain or fix it?

      The only workaround we found so far is to step down the master of the TO-shard. However, if all 3 replSet-members of the TO-shard throw the same error, we need to restart a secondary, elect it primary and then we are able to continue the distribution of new empty chunks - until the next error "can't accept new chunks" arrives.

      Please see also SERVER-14047 which I have created for the same problem. However, this time it seems not to be related to noTimeoutCursors because they have been killed by restarting the server(s). Also the shards accepted already new chunks. They stop to accept new chunks out of the blue sky with an illogical error message.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: