Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Incomplete
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.6.10
Component/s: Sharding
Labels:
None

Operating System:
ALL
Steps To Reproduce:

Hide

Create 1.000 new empty chunks and try to move them among all shards.

Show
Create 1.000 new empty chunks and try to move them among all shards.
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We have several sharded clusters running mongodb v2.6.10. We regularely pre-split so that all new documents will be inserted evenly among all shards. The balancer is always off because we prefer to distribute our documents manually dependent on the hardware (RAM) since not all shards have the same amount of RAM.

In order to pre-split, we create new chunks with sh.splitAt. Once hundreds or thousands new empty chunks are created, we move them from the origin shard evenly distributed to the other shards by using sh.moveChunk. This should be a very quick operation because the chunks to move are empty.

From time to time we encounter the following error. The bigger the cluster, the more often the error seems to happen.

{
        "cause" : {
                "cause" : {
                        "ok" : 0,
                        "errmsg" : "can't accept new chunks because  there are still 8 deletes from previous migration"
                },
                "ok" : 0,
                "errmsg" : "moveChunk failed to engage TO-shard in the data transfer: can't accept new chunks because  there are still 8 deletes from previous migration"
        },
        "ok" : 0,
        "errmsg" : "move failed"
}

The error may also happen after all shards have received already empty chunks, so they already have accepted new chunks. However, some seconds later they refuse new chunks, telling that "there are still n deletes from previous migration" even though all previous received chunks were all empty! This seems very illogical for us. Can you explain or fix it?

The only workaround we found so far is to step down the master of the TO-shard. However, if all 3 replSet-members of the TO-shard throw the same error, we need to restart a secondary, elect it primary and then we are able to continue the distribution of new empty chunks - until the next error "can't accept new chunks" arrives.

Please see also ~~SERVER-14047~~ which I have created for the same problem. However, this time it seems not to be related to noTimeoutCursors because they have been killed by restarting the server(s). Also the shards accepted already new chunks. They stop to accept new chunks out of the blue sky with an illogical error message.

related to

SERVER-14047 endless "moveChunk failed, because there are still n deletes from previous migration"

Closed

Assignee:: Kelsey Schubert
Reporter:: Kay Agahd
Participants:: Kay Agahd, Kelsey Schubert, Ramon Fernandez Marina
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Jan 25 2016 01:55:07 PM UTC
Updated:: Apr 13 2016 03:40:28 PM UTC
Resolved:: Mar 28 2016 06:06:02 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates