|
Author:
{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}
Message: SERVER-12638 Sharding chunks ranges overlap
(cherry picked from commit bc26f73ef697c844cde8e4561bbf9a9c2f4728be)
Branch: v2.6
https://github.com/mongodb/mongo/commit/337729f2d0b943e0557a1c709a3bc155bf374d51
|
|
Author:
{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}
Message: SERVER-12638 Sharding chunks ranges overlap
Branch: master
https://github.com/mongodb/mongo/commit/bc26f73ef697c844cde8e4561bbf9a9c2f4728be
|
|
The issue comes from the way mongos performs the initial splits for empty hashed collection. This is currently done in 3 stages - (1) split the hash space into N chunks, where N is number of shards. (2) Move the chunks to each shard (3) do a finer split on the chunks. So, the issue comes about when the split point in step 3 also exists in step 1. This is not properly handled in the CollectionMetadata and will make the chunk have a range of min -> min. This is completely internal to the shard, but can 'taint' the config servers when the chunk gets moved because the information it uses on the applyOps for the moveChunk command will be from this bad chunk info.
|
|
With hashed sharding the initial pre-split is expected to move the empty chunks evenly around the cluster - this does not engage the "regular" balancer.
|
chunks:
|
shard_002 2
|
shard_001 2
|
shard_003 2
|
shard_004 2
|
Why doesn't it look like this? after the initialize. I hoped so (for new collections).
|
|
That is a very good point. Why did some of the chunks move even with the balancer disabled?
|
|
omitting the numInitialChunks argument looks like a easy work around. With this "fix", I can not reproduce it.
|
|
But why some chunks are moved and some not? With deactivated balancer.
mongos> sh.status()
|
--- Sharding Status ---
|
sharding version: {
|
"_id" : 1,
|
"version" : 3,
|
"minCompatibleVersion" : 3,
|
"currentVersion" : 4,
|
"clusterId" : ObjectId("53300f28512956601e55ce7f")
|
}
|
shards:
|
{ "_id" : "shard_001", "host" : "127.0.0.1:20117" }
|
{ "_id" : "shard_002", "host" : "127.0.0.1:20217" }
|
{ "_id" : "shard_003", "host" : "127.0.0.1:20317" }
|
{ "_id" : "shard_004", "host" : "127.0.0.1:20417" }
|
databases:
|
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
|
...
|
{ "_id" : "ci_400000000000008", "partitioned" : true, "primary" : "shard_001" }
|
ci_400000000000008.some_collection
|
shard key: { "_id" : "hashed" }
|
chunks:
|
shard_001 8
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-6917529027641081850") } on : shard_001 Timestamp(1, 4)
|
{ "_id" : NumberLong("-6917529027641081850") } -->> { "_id" : NumberLong("-4611686018427387900") } on : shard_001 Timestamp(1, 5)
|
{ "_id" : NumberLong("-4611686018427387900") } -->> { "_id" : NumberLong("-2305843009213693950") } on : shard_001 Timestamp(1, 6)
|
{ "_id" : NumberLong("-2305843009213693950") } -->> { "_id" : NumberLong(0) } on : shard_001 Timestamp(1, 7)
|
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("2305843009213693950") } on : shard_001 Timestamp(1, 8)
|
{ "_id" : NumberLong("2305843009213693950") } -->> { "_id" : NumberLong("4611686018427387900") } on : shard_001 Timestamp(1, 9)
|
{ "_id" : NumberLong("4611686018427387900") } -->> { "_id" : NumberLong("6917529027641081850") } on : shard_001 Timestamp(1, 10)
|
{ "_id" : NumberLong("6917529027641081850") } -->> { "_id" : { "$maxKey" : 1 } } on : shard_001 Timestamp(1, 11)
|
{ "_id" : "ci_400000000000001", "partitioned" : true, "primary" : "shard_001" }
|
ci_400000000000001.some_collection
|
shard key: { "_id" : "hashed" }
|
chunks:
|
shard_002 1
|
shard_001 3
|
shard_003 2
|
shard_004 2
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-6917529027641081850") } on : shard_002 Timestamp(4, 0)
|
{ "_id" : NumberLong("-6917529027641081850") } -->> { "_id" : NumberLong("-4611686018427387900") } on : shard_001 Timestamp(4, 1)
|
{ "_id" : NumberLong("-4611686018427387900") } -->> { "_id" : NumberLong("-2305843009213693950") } on : shard_001 Timestamp(3, 4)
|
{ "_id" : NumberLong("-2305843009213693950") } -->> { "_id" : NumberLong(0) } on : shard_001 Timestamp(3, 5)
|
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("2305843009213693950") } on : shard_003 Timestamp(3, 6)
|
{ "_id" : NumberLong("2305843009213693950") } -->> { "_id" : NumberLong("4611686018427387900") } on : shard_003 Timestamp(3, 7)
|
{ "_id" : NumberLong("4611686018427387900") } -->> { "_id" : NumberLong("6917529027641081850") } on : shard_004 Timestamp(3, 8)
|
{ "_id" : NumberLong("6917529027641081850") } -->> { "_id" : { "$maxKey" : 1 } } on : shard_004 Timestamp(3, 9)
|
{ "_id" : "ci_400000000000006", "partitioned" : true, "primary" : "shard_001" }
|
ci_400000000000006.some_collection
|
shard key: { "_id" : "hashed" }
|
chunks:
|
shard_003 1
|
shard_001 5
|
shard_002 2
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-6917529027641081850") } on : shard_003 Timestamp(3, 0)
|
{ "_id" : NumberLong("-6917529027641081850") } -->> { "_id" : NumberLong("-4611686018427387900") } on : shard_001 Timestamp(3, 1)
|
{ "_id" : NumberLong("-4611686018427387900") } -->> { "_id" : NumberLong("-2305843009213693950") } on : shard_002 Timestamp(2, 4)
|
{ "_id" : NumberLong("-2305843009213693950") } -->> { "_id" : NumberLong(0) } on : shard_002 Timestamp(2, 5)
|
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("2305843009213693950") } on : shard_001 Timestamp(2, 6)
|
{ "_id" : NumberLong("2305843009213693950") } -->> { "_id" : NumberLong("4611686018427387900") } on : shard_001 Timestamp(2, 7)
|
{ "_id" : NumberLong("4611686018427387900") } -->> { "_id" : NumberLong("6917529027641081850") } on : shard_001 Timestamp(2, 8)
|
{ "_id" : NumberLong("6917529027641081850") } -->> { "_id" : { "$maxKey" : 1 } } on : shard_001 Timestamp(2, 9)
|
{ "_id" : "ci_400000000000005", "partitioned" : true, "primary" : "shard_001" }
|
ci_400000000000005.some_collection
|
shard key: { "_id" : "hashed" }
|
chunks:
|
shard_002 1
|
shard_001 7
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-6917529027641081850") } on : shard_002 Timestamp(2, 0)
|
{ "_id" : NumberLong("-6917529027641081850") } -->> { "_id" : NumberLong("-4611686018427387900") } on : shard_001 Timestamp(2, 1)
|
{ "_id" : NumberLong("-4611686018427387900") } -->> { "_id" : NumberLong("-2305843009213693950") } on : shard_001 Timestamp(1, 6)
|
{ "_id" : NumberLong("-2305843009213693950") } -->> { "_id" : NumberLong(0) } on : shard_001 Timestamp(1, 7)
|
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("2305843009213693950") } on : shard_001 Timestamp(1, 8)
|
{ "_id" : NumberLong("2305843009213693950") } -->> { "_id" : NumberLong("4611686018427387900") } on : shard_001 Timestamp(1, 9)
|
{ "_id" : NumberLong("4611686018427387900") } -->> { "_id" : NumberLong("6917529027641081850") } on : shard_001 Timestamp(1, 10)
|
{ "_id" : NumberLong("6917529027641081850") } -->> { "_id" : { "$maxKey" : 1 } } on : shard_001 Timestamp(1, 11)
|
|
|
@Shaun Verch
1. Deactivate balancer
2. 10 Threads, each create 1 collection, activate sharding, insert 100000 Entries in the created collection
db.adminCommand({ enableSharding : dbName });
|
db.adminCommand({ shardcollection : dbName + "." + collName, key : { _id : "hashed" }, numInitialChunks : 15 }));
|
fillCollection(100000)
|
3. Activate balancer
4. Check config data (config data has overlapping ranges)
But it looks like the 100000 entries and deaktivate the balancer are not important for this bug.
|
|
I probably should not have called it a race condition - a specific sequence of steps will reproduce this. Among the sequence is a failure to migrate one of the initially split chunks - this is much more likely to happen when multiple dbs/collections are being sharded at approximately the same time.
The bug that fixed this issue fixed a mistake in the code that manifested differently depending on sequence of events (and whether they were any failures which caused chunks to be on particular shards).
So it was only a race condition if you have multiple dbs/collections sharding happening at approximately the same time. The more of them, the more likely that one will have a failure to do "normal" initial steps, and create a sequence of events that's necessary to trigger this bug.
There are several ways you can work around this bug until 2.4.10 comes out - the simplest one would be to set up your scripts to shard dbs and enable sharding on "hash" keys on collections one at a time, checking the previous one's success. The initial steps may take a little bit longer but since this is splitting and balancing empty chunks, the whole thing is a matter of seconds.
Another way of avoiding situation that can trigger this is I think by omitting the numInitialChunks argument - I haven't been able to trigger this bug when default number of chunks (numShards*2) is used. Possibly because that uses regular split and larger number of chunks will use "multi-split" - I'm going to leave the rest of debugging to appropriate kernel engineers who work on sharding.
It's possible you may need to do both of the above to guarantee the bug can't be triggered.
|
|
It seems like you have a root cause for the broken shard ranges but the bug mentioned above does not address any race condition. Could you please elaborate on the specifics of the race condition and if possible the exact bug/commits that address it.
Race conditions are difficult to understand without context so this would really help in understanding what series of events caused the problem.
|
|
It appears the shard ranges were broken as a side effect of SERVER-12515 (though it seems to involve a race condition so it didn't always manifest unless the circumstances always created necessary sequence of conditions).
The good news is that I cannot reproduce this on either 2.4 master (intended 2.4.10 release) nor 2.6.0 release candidate.
|
|
The trouble starts on line 4173 of repro24.out
This has 10 collections, 9 of them get this wrong chunk range, the one that doesn't is collection5.
|
|
sverch I made a small modification to your script and ran it a few times with 2.6 and couldn't get it to fail but the first time I tried it on 2.4.8 it failed in exactly the way that others saw. I'm going to attach the script and the output.
|
|
Björn Bullerdieck just double checking - you checked config DB before the balancing started and you said all was okay - were all the chunks created? So the only difference between the config before and after was chunks were moved to different shards and the ranges for some of them in the process became incorrect, is that right?
|
|
achauhan@brightcove.com it may be helpful to get a copy of your config database - would you be able to attach it to this ticket?
|
|
We were able to reproduce this behaviour. Yesterday, after we got things to work quite well. I realized that we had the balancer turned off, I then enabled the balancer and very soon after that the config on one of the collections got corrupted in the same way (overlapping chunks). I have a feeling that there is something funky going on with the balancer that is biting us. There were a lot of chunk migrate aborted messages in some of the mongod logs. We are running mongos with the --noAutoSplit option and numInitialChunks = 4*shardcount.
|
|
Hi Björn,
I'm trying to reproduce your test, but I'm not entirely clear on what you are doing. Do you mean to:
- Deactivate balancer
- Spawn 10 threads. In each thread, do the following 100000 times, for different values of "dbName" and "collName":
-
db.adminCommand({ enableSharding : dbName });
|
-
db.adminCommand({ shardcollection : dbName + "." + collName, key : { _id : "hashed" }, numInitialChunks : 15 }));
|
- Activate balancer
- Check config data (config data has overlapping ranges)
I've attached a script to this ticket that I think is based on what you described using the mongo shell, but I haven't been able to reproduce it. Can you reproduce it creating fewer than "100,000" databases in each thread? Am I understanding that number correctly?
Thanks,
~Shaun Verch
|
|
yes I have a little java program. I'll see if I can extract it.
But it is very simple:
create one mongoclient -> start 10 threads -> (this time: deaktivate balancer)
-> every thread: create database (ci_40000000000000*) -> create collection with enablesharding _id:hashed,numInitialChunks:15 -> create 100.000 entries
=> config looks ok (but chunks are not distributed - sh.status())
-> activate balancer
=> config looks bad
same result without deaktivate balancer, but I think a balancer bug.
I can reproduce it easily
Testsystem (1 developer machine- win7 64-bit): 3 single-shard-instances (no replicas), 3 config-instances, 1 mongos
|
|
config dump and logs (20140321_1540.rar)
|
|
Björn Bullerdieck, do you have a script that can reproduce this behavior? If so, please attach it to the issue so we can try to reproduce it as well. If not, can you please upload a mongodump of the config database, and logs from your mongos instances that cover the period when you noticed this behavior?
|
|
first try with 15 initial chunks was successful but now..
different number of chunks does not solve the problem (next try with deaktivated balancer)
{ "_id" : "ci_400000000000006", "partitioned" : true, "primary" : "shard_001" }
|
ci_400000000000006.some_collection
|
shard key: { "_id" : "hashed" }
|
chunks:
|
shard_002 5
|
shard_003 5
|
shard_001 5
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-7993589098607472360") } on : shard_002 Timestamp(2, 0)
|
{ "_id" : NumberLong("-7993589098607472360") } -->> { "_id" : NumberLong("-6763806160360168920") } on : shard_003 Timestamp(3, 0)
|
{ "_id" : NumberLong("-6763806160360168920") } -->> { "_id" : NumberLong("-5534023222112865480") } on : shard_002 Timestamp(4, 0)
|
{ "_id" : NumberLong("-5534023222112865480") } -->> { "_id" : NumberLong("-4304240283865562040") } on : shard_003 Timestamp(5, 0)
|
{ "_id" : NumberLong("-4304240283865562040") } -->> { "_id" : NumberLong("-3074457345618258600") } on : shard_002 Timestamp(6, 0)
|
{ "_id" : NumberLong("-3074457345618258600") } -->> { "_id" : NumberLong("-1844674407370955160") } on : shard_003 Timestamp(7, 0)
|
{ "_id" : NumberLong("-1844674407370955160") } -->> { "_id" : NumberLong("-614891469123651720") } on : shard_002 Timestamp(8, 0)
|
{ "_id" : NumberLong("-614891469123651720") } -->> { "_id" : NumberLong("614891469123651720") } on : shard_003 Timestamp(9, 0)
|
{ "_id" : NumberLong("614891469123651720") } -->> { "_id" : NumberLong("1844674407370955160") } on : shard_002 Timestamp(10, 0)
|
{ "_id" : NumberLong("1844674407370955160") } -->> { "_id" : NumberLong("3074457345618258600") } on : shard_003 Timestamp(11, 0)
|
{ "_id" : NumberLong("3074457345618258600") } -->> { "_id" : NumberLong("7993589098607472360") } on : shard_001 Timestamp(11, 1)
|
{ "_id" : NumberLong("4304240283865562040") } -->> { "_id" : NumberLong("5534023222112865480") } on : shard_001 Timestamp(1, 14)
|
{ "_id" : NumberLong("5534023222112865480") } -->> { "_id" : NumberLong("6763806160360168920") } on : shard_001 Timestamp(1, 15)
|
{ "_id" : NumberLong("6763806160360168920") } -->> { "_id" : NumberLong("7993589098607472360") } on : shard_001 Timestamp(1, 16)
|
{ "_id" : NumberLong("7993589098607472360") } -->> { "_id" : { "$maxKey" : 1 } } on : shard_001 Timestamp(1, 17)
|
interesting, always the final number twice and always in all sharded collections the same chunk(s)
{ "_id" : NumberLong("3074457345618258600") } -->> { "_id" : NumberLong("7993589098607472360") } on : shard_001 Timestamp(11, 1)
|
{ "_id" : NumberLong("6763806160360168920") } -->> { "_id" : NumberLong("7993589098607472360") } on : shard_001 Timestamp(1, 16)
|
|
|
Is there a possible resolution or a workaround for this bug?
|
|
Same Problem.
Operating System:Win7-64
Affects Version/s: 2.4.9
10 Threads creating 10 shard databases with sharded collections (hashed shard key, 8 chunks)
2014-03-20 15:17:48,621 INFO [main] de.webtrekk.mongodbvscouchbase.AbstractDatabaseHandler: startAccountThreads..
|
2014-03-20 15:17:48,926 INFO [pool-2-thread-7] ***.MongoDBUtils: Command: { "enablesharding" : "ci_400000000000007"}. Enabled sharding for database: { "serverUsed" : "***:27017" , "ok" : 1.0}
|
2014-03-20 15:17:49,059 INFO [pool-2-thread-5] ***.MongoDBUtils: Command: { "enablesharding" : "ci_400000000000005"}. Enabled sharding for database: { "serverUsed" : "***:27017" , "ok" : 1.0}
|
...
|
2014-03-20 15:17:54,480 INFO [pool-2-thread-3] ***.MongoDBUtils: Command: { "shardCollection" : "ci_400000000000003.some_collection" , "key" : { "_id" : "hashed"} , "numInitialChunks" : 8}. Enabled sharding for database: { "serverUsed" : "***:27017" , "collectionsharded" : "ci_400000000000003.some_collection" , "ok" : 1.0}
|
2014-03-20 15:17:54,481 INFO [pool-2-thread-7] ***.MongoDBUtils: Command: { "shardCollection" : "ci_400000000000007.some_collection" , "key" : { "_id" : "hashed"} , "numInitialChunks" : 8}. Enabled sharding for database: { "serverUsed" : "***:27017" , "collectionsharded" : "ci_400000000000007.some_collection" , "ok" : 1.0}
|
...
|
only 2 of 10 ok:
mongos> sh.status()
|
--- Sharding Status ---
|
sharding version: {
|
"_id" : 1,
|
"version" : 3,
|
"minCompatibleVersion" : 3,
|
"currentVersion" : 4,
|
"clusterId" : ObjectId("532af869d4e449358dc2f64d")
|
}
|
shards:
|
{ "_id" : "shard_001", "host" : "***:20117" }
|
{ "_id" : "shard_002", "host" : "***:20217" }
|
{ "_id" : "shard_003", "host" : "***:20317" }
|
databases:
|
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
|
{ "_id" : "ci_400000000000007", "partitioned" : true, "primary" : "shard_001" }
|
ci_400000000000007.some_collection
|
shard key: { "_id" : "hashed" }
|
chunks:
|
shard_002 3
|
shard_003 2
|
shard_001 3
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-6917529027641081850") } on : shard_002 Timestamp(2, 0)
|
{ "_id" : NumberLong("-6917529027641081850") } -->> { "_id" : NumberLong("-4611686018427387900") } on : shard_003 Timestamp(3, 0)
|
{ "_id" : NumberLong("-4611686018427387900") } -->> { "_id" : NumberLong("-2305843009213693950") } on : shard_002 Timestamp(4, 0)
|
{ "_id" : NumberLong("-2305843009213693950") } -->> { "_id" : NumberLong(0) } on : shard_003 Timestamp(5, 0)
|
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("2305843009213693950") } on : shard_002 Timestamp(6, 0)
|
{ "_id" : NumberLong("2305843009213693950") } -->> { "_id" : NumberLong("6917529027641081850") } on : shard_001 Timestamp(6, 1)
|
{ "_id" : NumberLong("4611686018427387900") } -->> { "_id" : NumberLong("6917529027641081850") } on : shard_001 Timestamp(1, 9)
|
{ "_id" : NumberLong("6917529027641081850") } -->> { "_id" : { "$maxKey" : 1 } } on : shard_001 Timestamp(1, 10)
|
the problem:
both max 6917529027641081850
{ "_id" : NumberLong("2305843009213693950") } -->> { "_id" : NumberLong("6917529027641081850") } on : shard_001 Timestamp(4, 1)
|
{ "_id" : NumberLong("4611686018427387900") } -->> { "_id" : NumberLong("6917529027641081850") } on : shard_001 Timestamp(2, 8)
|
more details:
shard-problem.txt
mongos1-failedShardConfig.log
|
|
Is it possible to send all the mongos logs and a mongodump of the config database?
|
|
For addition, here is code that was used to create this collection:
db.adminCommand( { shardCollection: "stats_archive_monthly", key: {a: "hashed"}, numInitialChunks: 12 } )
|
db.stats_archive_monthly.ensureIndex( { "a": 1, "l_id":1, "t": 1 } )
|
db.stats_archive_monthly.ensureIndex( {t: 1}, { expireAfterSeconds: 2678400 + 604800} )
|
|
Generated at Thu Feb 08 03:29:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.