[SERVER-14047] endless "moveChunk failed, because there are still n deletes from previous migration" Created: 26/May/14  Updated: 10/Feb/16  Resolved: 08/Jul/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Thomas Rueckstiess
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

64 Bit, wheezy


Issue Links:
Depends
depends on SERVER-3090 Add the ability to list open cursors Closed
Related
is related to SERVER-22290 endless "moveChunk failed, because th... Closed
is related to DOCS-4575 Warn about leaving immortal cursors o... Closed
Operating System: Linux
Participants:

 Description   

Our mongodb cluster was unable to move chunks during 2 days until we restarted the whole cluster. The logs stated always the same error: ["moveChunk failed to engage TO-shard in the data transfer: can't accept new chunks because there are still 81 deletes from previous migration"]

mongos> sh.moveChunk("offerStore.offer", { _id: 540867465 }, "offerStoreIT2");
{
        "cause" : {
                "cause" : {
                        "ok" : 0,
                        "errmsg" : "can't accept new chunks because  there are still 81 deletes from previous migration"
                },
                "ok" : 0,
                "errmsg" : "moveChunk failed to engage TO-shard in the data transfer: can't accept new chunks because  there are still 81 deletes from previous migration"
        },
        "ok" : 0,
        "errmsg" : "move failed"
}

mongos> db.changelog.find().sort({time:-1})
{ "_id" : "s163-2014-05-25T21:59:45-538267d121bb6c9634f5a4c0", "server" : "s163", "clientAddr" : "172.16.66.17:56136", "time" : ISODate("2014-05-25T21:59:45.234Z"), "what" : "moveChunk.from", "ns" : "offerStore.offer", "details" : { "min" : { "_id" : NumberLong(540867465) }, "max" : { "_id" : NumberLong(541166222) }, "step 1 of 6" : 0, "step 2 of 6" : 286, "note" : "aborted" } }
{ "_id" : "s163-2014-05-25T21:59:45-538267d121bb6c9634f5a4bf", "server" : "s163", "clientAddr" : "172.16.66.17:56136", "time" : ISODate("2014-05-25T21:59:45.061Z"), "what" : "moveChunk.start", "ns" : "offerStore.offer", "details" : { "min" : { "_id" : NumberLong(540867465) }, "max" : { "_id" : NumberLong(541166222) }, "from" : "offerStoreIT", "to" : "offerStoreIT2" } }



 Comments   
Comment by luzi [ 16/Oct/14 ]

hi,
What about this problem,my mongo cluster has restarted some mongods for it many times!!!

thanks.
lu zi

Comment by Thomas Rueckstiess [ 08/Jul/14 ]

Hi Kay,

It looks like Siyuan and Randolph provided the answers you were looking for, and the desired features are already covered with SERVER-3090 and SERVER-13648 and in the pipline.

I'm closing this ticket now. If you have further support questions please post on the mongodb-users group (http://groups.google.com/group/mongodb-user) or Stack Overflow with the mongodb tag.

Regards,
Thomas

Comment by Kay Agahd [ 19/Jun/14 ]

Thanks Randolph for the helpful info. I just voted for SERVER-3090.

Comment by Randolph Tan [ 19/Jun/14 ]

Since no-timeout cursors are not bound to a socket/connection they will live forever if the application died before reaching the end of the resultset of the cursor.
How to deal with?

Unfortunately, the only option for the moment is to restart the mongod to make it 'forget' about the cursors.

Should I open a feature request to be able to delete inactive no-timeout cursors or is it planned already?

The server can't tell if a no timeout cursor is actually inactive or the client who opened it is still interested in it, so it has to be manually killed by the client who created it (otherwise, the client should have used a cursor with a timeout, which is 10 min by default). On the other hand, you might find this ticket helpful - SERVER-3090.

Does mongodb has an option to interdict no-timeout cursors?

Currently none. The mongo also use no timeout cursors internally, so this can be tricky.

For the mean time, the initial commit from SERVER-13648 will display cursor ids the cleanup process is waiting for.

Comment by Kay Agahd [ 18/Jun/14 ]

Randolph, ok, I understand. I thought the no-timeout cursors were created by the process of chunk migration, where I don't have any chance to know which cursors were created.
However, you say that the no-timeout cursors were created by our applications and that we have to kill them by ourselves, right?

Since no-timeout cursors are not bound to a socket/connection they will live forever if the application died before reaching the end of the resultset of the cursor.
How to deal with?

Should I open a feature request to be able to delete inactive no-timeout cursors or is it planned already?

Does mongodb has an option to interdict no-timeout cursors?

Comment by Randolph Tan [ 18/Jun/14 ]

Hi,

The drivers should know the cursor ids since they use it to call getMore - so you can indirectly close it by calling close on the cursor object (not all drivers provide this api though). I am not familiar with your setup, but this option is most likely realistically harder to do than just restarting the server. The only server you need to restart is the server waiting for open cursors, which in your case is offerStoreIT2.

Comment by Kay Agahd [ 18/Jun/14 ]

Randolph, what do you mean with "restart the server"? Which server(s) do I have to restart? The router(s), or just one primary or the whole replSet (which one?) or all primaries or all nodes of all replSets or all three configservers?

Comment by Kay Agahd [ 18/Jun/14 ]

Randolph, how can I close a cursor when I don't know the cursor ID?

Comment by Randolph Tan [ 16/Jun/14 ]

Hi,

There is currently no direct way to know the cursor IDs. The work on SERVER-13648 will include showing the ids of the cursors the server is waiting for. For the mean time, things you can do to unblock is either close/deplete the cursors (note that new cursors created after the cleanup started can be ignored.) or restart the server (this will result in orphaned documents).

Comment by Kay Agahd [ 14/Jun/14 ]

siyuan.zhou@10gen.com, indeed, we see "rangeDeleter waiting for n cursors..." in our logs, whereas n is never below 10, so it's waiting forever. Also, we see the stats of open cursors (many of them are notimeout curors) but we don't see how to kill these cursors since we don't have the ID for them.

So my question is still the same: how to deblock the deletion from previous migration?
Thanks!

Comment by Siyuan Zhou [ 11/Jun/14 ]

Hi kay.agahd@idealo.de,

If you are using v2.6.1, the logs from the primary should be enough to observe the open cursors blocking migration cleanup, like "rangeDeleter waiting for XXX cursors in <XXX namespace> to finish". currentOp() cannot provide the information about cursors, but serverStatus() gives back the stats of open cursors, especially those with no timeout. We are working on improving the logs and stats of migration cleanup. You can follow SERVER-13648 for updates.

Comment by Kay Agahd [ 05/Jun/14 ]

Thank you very much Siyuan for coming back. Do you need the logs from both primary and secondaries or only from the primary?

Shouldn't we find all open cursors by executing db.currentOp() which have high secs_runing values, so we could kill them? However, I couldn't find any of such operations.

How to proceed? Just restarting all servers is not a viable solution.

Thanks for your help!

Comment by Siyuan Zhou [ 04/Jun/14 ]

Hey kay.agahd@idealo.de,

The chunk deletion on FROM-shard is asynchronous. In your case, offerStoreIT2 has 81 chunks pending for deletion. It is not allowed to move chunks to this shard if the shard is waiting for deletion, since we may lose the new migrated data during the deletion from previous migration. Usually, the deletion should be very fast, but an open cursor during the time of deletion can block it. If the cursor has no timeout, the deletion will be blocked until the cursor gets closed or the mongod restarts.

To help us diagnose this issue, it would be very helpful if we can have the log on offerStoreIT2 during these 2 days.

Thanks,
Siyuan

Comment by Kay Agahd [ 04/Jun/14 ]

Is there any progress on this issue?

Generated at Thu Feb 08 03:33:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.