[SERVER-14047] endless "moveChunk failed, because there are still n deletes from previous migration" Created: 26/May/14 Updated: 10/Feb/16 Resolved: 08/Jul/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.6.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Thomas Rueckstiess |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
64 Bit, wheezy |
||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | Linux | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Our mongodb cluster was unable to move chunks during 2 days until we restarted the whole cluster. The logs stated always the same error: ["moveChunk failed to engage TO-shard in the data transfer: can't accept new chunks because there are still 81 deletes from previous migration"]
|
| Comments |
| Comment by luzi [ 16/Oct/14 ] |
|
hi, thanks. |
| Comment by Thomas Rueckstiess [ 08/Jul/14 ] |
|
Hi Kay, It looks like Siyuan and Randolph provided the answers you were looking for, and the desired features are already covered with I'm closing this ticket now. If you have further support questions please post on the mongodb-users group (http://groups.google.com/group/mongodb-user) or Stack Overflow with the mongodb tag. Regards, |
| Comment by Kay Agahd [ 19/Jun/14 ] |
|
Thanks Randolph for the helpful info. I just voted for |
| Comment by Randolph Tan [ 19/Jun/14 ] |
Unfortunately, the only option for the moment is to restart the mongod to make it 'forget' about the cursors.
The server can't tell if a no timeout cursor is actually inactive or the client who opened it is still interested in it, so it has to be manually killed by the client who created it (otherwise, the client should have used a cursor with a timeout, which is 10 min by default). On the other hand, you might find this ticket helpful -
Currently none. The mongo also use no timeout cursors internally, so this can be tricky. For the mean time, the initial commit from |
| Comment by Kay Agahd [ 18/Jun/14 ] |
|
Randolph, ok, I understand. I thought the no-timeout cursors were created by the process of chunk migration, where I don't have any chance to know which cursors were created. Since no-timeout cursors are not bound to a socket/connection they will live forever if the application died before reaching the end of the resultset of the cursor. Should I open a feature request to be able to delete inactive no-timeout cursors or is it planned already? Does mongodb has an option to interdict no-timeout cursors? |
| Comment by Randolph Tan [ 18/Jun/14 ] |
|
Hi, The drivers should know the cursor ids since they use it to call getMore - so you can indirectly close it by calling close on the cursor object (not all drivers provide this api though). I am not familiar with your setup, but this option is most likely realistically harder to do than just restarting the server. The only server you need to restart is the server waiting for open cursors, which in your case is offerStoreIT2. |
| Comment by Kay Agahd [ 18/Jun/14 ] |
|
Randolph, what do you mean with "restart the server"? Which server(s) do I have to restart? The router(s), or just one primary or the whole replSet (which one?) or all primaries or all nodes of all replSets or all three configservers? |
| Comment by Kay Agahd [ 18/Jun/14 ] |
|
Randolph, how can I close a cursor when I don't know the cursor ID? |
| Comment by Randolph Tan [ 16/Jun/14 ] |
|
Hi, There is currently no direct way to know the cursor IDs. The work on |
| Comment by Kay Agahd [ 14/Jun/14 ] |
|
siyuan.zhou@10gen.com, indeed, we see "rangeDeleter waiting for n cursors..." in our logs, whereas n is never below 10, so it's waiting forever. Also, we see the stats of open cursors (many of them are notimeout curors) but we don't see how to kill these cursors since we don't have the ID for them. So my question is still the same: how to deblock the deletion from previous migration? |
| Comment by Siyuan Zhou [ 11/Jun/14 ] |
|
If you are using v2.6.1, the logs from the primary should be enough to observe the open cursors blocking migration cleanup, like "rangeDeleter waiting for XXX cursors in <XXX namespace> to finish". currentOp() cannot provide the information about cursors, but serverStatus() gives back the stats of open cursors, especially those with no timeout. We are working on improving the logs and stats of migration cleanup. You can follow |
| Comment by Kay Agahd [ 05/Jun/14 ] |
|
Thank you very much Siyuan for coming back. Do you need the logs from both primary and secondaries or only from the primary? Shouldn't we find all open cursors by executing db.currentOp() which have high secs_runing values, so we could kill them? However, I couldn't find any of such operations. How to proceed? Just restarting all servers is not a viable solution. Thanks for your help! |
| Comment by Siyuan Zhou [ 04/Jun/14 ] |
|
Hey kay.agahd@idealo.de, The chunk deletion on FROM-shard is asynchronous. In your case, offerStoreIT2 has 81 chunks pending for deletion. It is not allowed to move chunks to this shard if the shard is waiting for deletion, since we may lose the new migrated data during the deletion from previous migration. Usually, the deletion should be very fast, but an open cursor during the time of deletion can block it. If the cursor has no timeout, the deletion will be blocked until the cursor gets closed or the mongod restarts. To help us diagnose this issue, it would be very helpful if we can have the log on offerStoreIT2 during these 2 days. Thanks, |
| Comment by Kay Agahd [ 04/Jun/14 ] |
|
Is there any progress on this issue? |