[SERVER-22290] endless "moveChunk failed, because there are still n deletes from previous migration" Created: 25/Jan/16 Updated: 13/Apr/16 Resolved: 28/Mar/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.6.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Kelsey Schubert |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | Create 1.000 new empty chunks and try to move them among all shards. |
||||||||
| Participants: | |||||||||
| Description |
|
We have several sharded clusters running mongodb v2.6.10. We regularely pre-split so that all new documents will be inserted evenly among all shards. The balancer is always off because we prefer to distribute our documents manually dependent on the hardware (RAM) since not all shards have the same amount of RAM. In order to pre-split, we create new chunks with sh.splitAt. Once hundreds or thousands new empty chunks are created, we move them from the origin shard evenly distributed to the other shards by using sh.moveChunk. This should be a very quick operation because the chunks to move are empty. From time to time we encounter the following error. The bigger the cluster, the more often the error seems to happen.
The error may also happen after all shards have received already empty chunks, so they already have accepted new chunks. However, some seconds later they refuse new chunks, telling that "there are still n deletes from previous migration" even though all previous received chunks were all empty! This seems very illogical for us. Can you explain or fix it? The only workaround we found so far is to step down the master of the TO-shard. However, if all 3 replSet-members of the TO-shard throw the same error, we need to restart a secondary, elect it primary and then we are able to continue the distribution of new empty chunks - until the next error "can't accept new chunks" arrives. Please see also |
| Comments |
| Comment by Kay Agahd [ 13/Apr/16 ] | |
|
Hi ramon.fernandez, as I said already we need to pre-split only every few months, so I could not give you feedback earlier. Today we pre-split again without any problem. | |
| Comment by Ramon Fernandez Marina [ 28/Mar/16 ] | |
|
kay.agahd@idealo.de, if no further information can be provided for another 2.5 months we're going to close this ticket for the time being. When you have additional information please post here and we'll reopen for further investigation. Note that we believe the root cause for this behavior is the same as for Regards, | |
| Comment by Kay Agahd [ 11/Mar/16 ] | |
|
Yes, we will do so as soon as we pre-split again. We need to pre-split only once every 3 months round about. | |
| Comment by Kelsey Schubert [ 10/Mar/16 ] | |
|
Sorry for the delay getting back to you. If this is still an issue for you, can you please upload the logs from the primary of donor shard and the primary of the target shard when you are experiencing this issue during chunk migration? This information will allow us to verify our explanation of the root cause of this behavior. Thank you, | |
| Comment by Kay Agahd [ 25/Jan/16 ] | |
|
Btw. the waitForDelete option did not help either.
|