[SERVER-28632] couldn't move chunk when doing shardCollection with hashed sharding key Created: 05/Apr/17  Updated: 27/Oct/23  Resolved: 17/Apr/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.0.9
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Martin Wu Assignee: Mark Agarunov
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

when I do sh.shardCollection with one collection on many process at a same time. It almost happens.

How to:

➜  ~ cat shardCollection.js
sh.enableSharding("log")
sh.shardCollection("log.log", {db: "hashed"})
sh.shardCollection("log.log2", {db: "hashed"})
sh.shardCollection("log.log3", {db: "hashed"})
sh.shardCollection("log.log4", {db: "hashed"})
sh.shardCollection("log.log5", {db: "hashed"})
sh.shardCollection("log.log6", {db: "hashed"})
 
➜  ~ cat test_shardColl.sh 
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &
mongo --port 27105  shardCollection.js &

then just execute test_shardColl.sh:

bash test_shardColl.sh

when you setLogLevel(5, "sharding") , you can see logs like this:

2017-04-05T18:09:43.889+0800 W COMMAND  [conn20] couldn't move chunk ns: log.log5, shard: shard0001:localhost:27102, lastmod: 1|3||000000000000000000000000, min: { db: 1844674407370955160 }, max: { db: 5534023222112865480 } to shard shard0002:localhost:27103 while sharding collection log.log5. Reason: { ok: 0.0, errmsg: "migration already in progress" }

I suppose that something wrong with ShardCollectionCmd::run() on mongo/s/commands_admin.cpp:

if (to == chunk->getShard())
    continue;



 Comments   
Comment by Martin Wu [ 24/Apr/17 ]

Hello mark.agarunov,

OK, Thank you for your response. That's clear.

Martin

Comment by Mark Agarunov [ 14/Apr/17 ]

Hello coolxwu,

The driver is returning OK because the collection is successfully sharded. The error message is related to the balancing of chunks across the shards, but the collection itself is properly sharded, so the shardCollection command is successful even if the balancing/chunk migration is not. Additionally, note that chunk migrations do not take a global lock.

Thanks,
Mark

Comment by Martin Wu [ 14/Apr/17 ]

Hello mark.agarunov

You got that. And then, My point is that shardCollection SHOULD NOT be called successful before the previous shardCollection command has completed with global lock. Because when the python driver return "OK" after "shardCollection", I think that everything is OK. In fact, it is not.

Thanks.

Comment by Mark Agarunov [ 06/Apr/17 ]

Hello coolxwu,

I may be misunderstanding the behavior, my apologies. From what I can see looking at your script and output, it essentially causes multiple shardCollection commands to be executed on the same collection in parallel. As the initial chunk migration is not instantaneous, it appears that the error you're seeing is due to a shardCollection command being called on a collection before the previous shardCollection command has completed. If I am missing something in my understanding, please let me know.

Thanks,
Mark

Comment by Martin Wu [ 06/Apr/17 ]

Hi mark.agarunov ,
I knew what `errmsg: "migration already in progress"` means. But it is weird that this process kept on for a long long time. And that is the initially moving chunk process, why could any migration process be in front of it.

Thanks.

Comment by Mark Agarunov [ 05/Apr/17 ]

Hello coolxwu,

Thank you for the report. Looking at the output you've provided, it appears that you are seeing this error because the chunk is still being moved due to the previously issued command. According to the logs:

errmsg: "migration already in progress" 

Thanks,
Mark

Generated at Thu Feb 08 04:18:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.