[SERVER-12515] Unable to move hashed shard key chunks created by numInitialChunks Created: 28/Jan/14  Updated: 11/Jul/16  Resolved: 18/Feb/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.9, 2.5.1, 2.5.5
Fix Version/s: 2.4.10, 2.6.0-rc0

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: hashed, numInitialChunks
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File hash_shard_num_chunks_move1.js     File hash_shard_num_chunks_move2.js     File hash_shard_num_chunks_move3.js    
Issue Links:
Depends
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   
Issue Status as of March 31, 2014

ISSUE SUMMARY

A bug in the sharding logic for hashed shard keys causes issues when sharding a collection on a hashed shard key and specifying the numInitialChunks option. Some chunks cannot be moved with the moveChunk command immediately after the collection was created.

USER IMPACT

This issue can lead to imbalanced data and issues during balancing in a sharded collection with a hashed shard key.

SOLUTION

Chunk splits now set the correct lower bound in the cached metadata within the shard.

WORKAROUNDS

A restart of mongod on the primary nodes between the shardCollection and moveChunk commands clears out the chunk manager cache and resolves the issue.

AFFECTED VERSIONS

Versions 2.4.0 to 2.4.9 are affected by this bug.

PATCHES

The fix is included in the 2.4.10 production release and the 2.6.0-rc0 release candidate, which will evolve into the 2.6.0 production release.

Original Description

When sharding a collection with a hashed shard key, and specifying numInitialChunks, some of these initial chunks are unable to be moved immediately afterwards.

jstests are attached.

In 2.4.9, the characterisation is:

  • Only chunks on the last shard are affected.
  • All but the final chunk are affected.
  • Before a successful chunk move, attempting to move problem chunks gives errors such as:

    {
            "cause" : {
                    "errmsg" : "exception: ranges differ, requested: { x: 0 } -> { x: 1152921504606846974 } existing: { x: 0 } -> { x: 8070450532247928818 }",
                    "code" : 13587,
                    "ok" : 0
            },
            "ok" : 0,
            "errmsg" : "move failed"
    }
    {
            "cause" : {
                    "errmsg" : "exception: ranges differ, requested: { x: 1152921504606846974 } -> { x: 2305843009213693948 } existing: { x: 1152921504606846974 } -> { x: MaxKey }",
                    "code" : 13587,
                    "ok" : 0
            },
            "ok" : 0,
            "errmsg" : "move failed"
    }
    {
            "cause" : {
                    "errmsg" : "exception: ranges differ, requested: { x: 2305843009213693948 } -> { x: 3458764513820540922 } existing: { x: 2305843009213693948 } -> { x: MaxKey }",
                    "code" : 13587,
                    "ok" : 0
            },
            "ok" : 0,
            "errmsg" : "move failed"
    }
    ...

  • After a successful chunk move, attempting to move a problem chunk gives a different error:

    { "ok" : 0, "errmsg" : "no chunk found with those upper and lower bounds" }

In 2.5.1+, the characterisation is:

  • All shards are affected.
  • All chunks are affected.
  • Attempting to move a chunk gives errors such as:

    {
            "cause" : {
                    "errmsg" : "exception: cannot remove chunk [{ x: 0 }, { x: 1152921504606846974 }), this shard does not contain the chunk and it overlaps [{ x: 0 }, { x: 8070450532247928818 })",
                    "code" : 16855,
                    "ok" : 0
            },
            "ok" : 0,
            "errmsg" : "move failed"
    }
    {
            "cause" : {
                    "errmsg" : "exception: cannot remove chunk [{ x: 1152921504606846974 }, { x: 2305843009213693948 }), this shard does not contain the chunk and it overlaps [{ x: 0 }, { x: 8070450532247928818 }), [{ x: 1152921504606846974 }, { x: MaxKey })",
                    "code" : 16855,
                    "ok" : 0
            },
            "ok" : 0,
            "errmsg" : "move failed"
    }
    {
            "cause" : {
                    "errmsg" : "exception: cannot remove chunk [{ x: 2305843009213693948 }, { x: 3458764513820540922 }), this shard does not contain the chunk and it overlaps [{ x: 1152921504606846974 }, { x: MaxKey }), [{ x: 2305843009213693948 }, { x: MaxKey })",
                    "code" : 16855,
                    "ok" : 0
            },
            "ok" : 0,
            "errmsg" : "move failed"
    }
    ...

The chunks look fine in config.chunks. Restarting the affected shard server between shardCollection and moveChunk allows the chunks to be moved sucessfully, so this is likely to be a bug in ChunkManager that causes it to get confused about chunk bounds. Specifically, it looks like the upper bound is not being set properly.



 Comments   
Comment by Björn Bullerdieck [X] [ 08/Apr/14 ]

I think there is still a thread a problem-when simultaneously create several sharded collections (SERVER-13491). Tried in 2.4.10.

Comment by Randolph Tan [ 07/Apr/14 ]

Hi dmurphy,

We just corrected the error in the description. If you are using hashed shard keys, I highly recommend upgrading to 2.4.10.

Thanks!

Comment by David Murphy [ 07/Apr/14 ]

This is wrong ALL move Chunks are affected in 2.4.6. Resulting in a broken balancer until the max chunk shard is stepped down. This is because the max chunk thinks it owns all chunks

Sent from my iPhone

Comment by Daniel Pasette (Inactive) [ 12/Mar/14 ]

dmurphy: here is the exact commit:
https://github.com/mongodb/mongo/commit/1ceeb84bd0b170bb367e8f78be53d2f075007cb5

It can be found on the v2.4 branch. 2.4.10 is still not a release candidate.

Comment by David Murphy [ 12/Mar/14 ]

Thanks Randolph,

Erik and I were planning to look at that today.

Is the branch currently the 2.4.10 or 2.4.9 base in the repo (2.4.10¹s
candidate I would assume right?)

Comment by Randolph Tan [ 12/Mar/14 ]

dmurphy I just backported the fix to the v2.4 branch so you can just clone the repo and build the source on that branch.

Comment by Githook User [ 11/Mar/14 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-12515 Unable to move hashed shard key chunks created by numInitialChunks

Backport fix for commit 3a08be3bf2a1a650c97543a448a8ea0c143a89b6
Branch: v2.4
https://github.com/mongodb/mongo/commit/1ceeb84bd0b170bb367e8f78be53d2f075007cb5

Comment by Daniel Pasette (Inactive) [ 05/Mar/14 ]

It hasn't been backported yet. When it is (and this is being worked on this week), it will be listed as a public comment on this ticket. Without having done the backport yet, I can't say exactly what the code will look like.

Comment by David Murphy [ 04/Mar/14 ]

I know this will be back ported, as you have it listed as 2.4.10 in fixed
version. I asked for the commit hash for the repo , so I can see what was
done in 2.4.10 , please provide this so we can patch our binary affecting
the customer while we wait for 2.4.10 to be released and verified stable.

If we don¹t have this I will need to just make a patch based on the 2.6
commit , which is obviously less optimal than to use a targeted commit for
the 2.4.10 tag.

David

Comment by Randolph Tan [ 04/Mar/14 ]

It will be backported to 2.4.10.

Comment by David Murphy [ 03/Mar/14 ]

I see this was tagged with 2.4.10, was this still using commit 3a08be3bf2a1a650c97543a448a8ea0c143a89b6, or is there a new commit for the 2.4.10 version of this?

Comment by Githook User [ 18/Feb/14 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-12515 Unable to move hashed shard key chunks created by numInitialChunks
Branch: master
https://github.com/mongodb/mongo/commit/3a08be3bf2a1a650c97543a448a8ea0c143a89b6

Generated at Thu Feb 08 03:28:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.