|
Not all chunks are moved or created correct! Config unusable for some collections!
Test Shard: 4 single shards(no rs), 3 config, 1mongos
Create 10 threads:
every thread create 1 Schema and 1 collection with enable sharding.
always hashed shardKey on _id and no numInitialChunks
case1 (with deaktivated balancer)
not all chunks are moved or created correct!
{ "_id" : "ci_400000000000005", "partitioned" : true, "primary" : "shard_001" }
|
ci_400000000000005.customer_informations
|
shard key: { "_id" : "hashed" }
|
chunks:
|
shard_002 1
|
shard_003 1
|
shard_004 1
|
shard_001 1
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611686018427387900") } on : shard_002 Timestamp(2, 0)
|
{ "_id" : NumberLong("-4611686018427387900") } -->> { "_id" : NumberLong(0) } on : shard_003 Timestamp(3, 0)
|
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611686018427387900") } on : shard_004 Timestamp(4, 0)
|
{ "_id" : NumberLong("4611686018427387900") } -->> { "_id" : { "$maxKey" : 1 } } on : shard_001 Timestamp(4, 1)
|
|
|
{ "_id" : "ci_400000000000019", "partitioned" : true, "primary" : "shard_004" }
|
ci_400000000000019.customer_informations
|
shard key: { "_id" : "hashed" }
|
chunks:
|
shard_004 8
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-6917529027641081850") } on : shard_004 Timestamp(1, 4)
|
{ "_id" : NumberLong("-6917529027641081850") } -->> { "_id" : NumberLong("-4611686018427387900") } on : shard_004 Timestamp(1, 5)
|
{ "_id" : NumberLong("-4611686018427387900") } -->> { "_id" : NumberLong("-2305843009213693950") } on : shard_004 Timestamp(1, 6)
|
{ "_id" : NumberLong("-2305843009213693950") } -->> { "_id" : NumberLong(0) } on : shard_004 Timestamp(1, 7)
|
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("2305843009213693950") } on : shard_004 Timestamp(1, 8)
|
{ "_id" : NumberLong("2305843009213693950") } -->> { "_id" : NumberLong("4611686018427387900") } on : shard_004 Timestamp(1, 9)
|
{ "_id" : NumberLong("4611686018427387900") } -->> { "_id" : NumberLong("6917529027641081850") } on : shard_004 Timestamp(1, 10)
|
{ "_id" : NumberLong("6917529027641081850") } -->> { "_id" : { "$maxKey" : 1 } } on : shard_004 Timestamp(1, 11)
|
|
|
{ "_id" : "ci_400000000000025", "partitioned" : true, "primary" : "shard_003" }
|
ci_400000000000025.customer_informations
|
shard key: { "_id" : "hashed" }
|
chunks:
|
shard_001 2
|
shard_002 2
|
shard_004 2
|
shard_003 2
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-6917529027641081850") } on : shard_001 Timestamp(2, 0)
|
{ "_id" : NumberLong("-6917529027641081850") } -->> { "_id" : NumberLong("-4611686018427387900") } on : shard_002 Timestamp(3, 0)
|
{ "_id" : NumberLong("-4611686018427387900") } -->> { "_id" : NumberLong("-2305843009213693950") } on : shard_004 Timestamp(4, 0)
|
{ "_id" : NumberLong("-2305843009213693950") } -->> { "_id" : NumberLong(0) } on : shard_001 Timestamp(5, 0)
|
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("2305843009213693950") } on : shard_002 Timestamp(6, 0)
|
{ "_id" : NumberLong("2305843009213693950") } -->> { "_id" : NumberLong("4611686018427387900") } on : shard_004 Timestamp(7, 0)
|
{ "_id" : NumberLong("4611686018427387900") } -->> { "_id" : NumberLong("6917529027641081850") } on : shard_003 Timestamp(7, 1)
|
{ "_id" : NumberLong("6917529027641081850") } -->> { "_id" : { "$maxKey" : 1 } } on : shard_003 Timestamp(1, 11)
|
|
{ "_id" : "ci_400000000000006", "partitioned" : true, "primary" : "shard_002" }
|
ci_400000000000006.customer_informations
|
shard key: { "_id" : "hashed" }
|
chunks:
|
shard_001 4
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611686018427387900") } on : shard_001 Timestamp(1, 0)
|
{ "_id" : NumberLong("-4611686018427387900") } -->> { "_id" : NumberLong(0) } on : shard_001 Timestamp(1, 1)
|
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611686018427387900") } on : shard_001 Timestamp(1, 2)
|
{ "_id" : NumberLong("4611686018427387900") } -->> { "_id" : { "$maxKey" : 1 } } on : shard_001 Timestamp(1, 3)
|
case 2 (with active balancer)
not all chunks are moved or created correct!
{ "_id" : "ci_400000000000018", "partitioned" : true, "primary" : "shard_003" }
|
ci_400000000000018.customer_informations
|
shard key: { "_id" : "hashed" }
|
chunks:
|
shard_003 4
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611686018427387900") } on : shard_003 Timestamp(1, 0)
|
{ "_id" : NumberLong("-4611686018427387900") } -->> { "_id" : NumberLong(0) } on : shard_003 Timestamp(1, 1)
|
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611686018427387900") } on : shard_003 Timestamp(1, 2)
|
{ "_id" : NumberLong("4611686018427387900") } -->> { "_id" : { "$maxKey" : 1 } } on : shard_003 Timestamp(1, 3)
|
|
Fri Apr 04 19:52:29.467 [Balancer] moving chunk ns: ci_400000000000018.customer_informations moving ( ns:ci_400000000000018.customer_informationsshard: shard_003:127.0.0.1:20317lastmod: 1|0||000000000000000000000000min: { _id: MinKey }max: { _id: -4611686018427387900 }) shard_003:127.0.0.1:20317 -> shard_001:127.0.0.1:20117
|
Fri Apr 04 19:52:29.552 [Balancer] moveChunk result: { who: { _id: "ci_400000000000018.customer_informations", process: "WT-INT34:20317:1396633596:41", state: 1, ts: ObjectId('533ef0092b8745cddb99e33e'), when: new Date(1396633609505), who: "WT-INT34:20317:1396633596:41:conn17:41", why: "migrate-{ _id: MinKey }" }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: MinKey }" }
|
Fri Apr 04 19:52:29.552 [Balancer] balancer move failed: { who: { _id: "ci_400000000000018.customer_informations", process: "WT-INT34:20317:1396633596:41", state: 1, ts: ObjectId('533ef0092b8745cddb99e33e'), when: new Date(1396633609505), who: "WT-INT34:20317:1396633596:41:conn17:41", why: "migrate-{ _id: MinKey }" }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: MinKey }" } from: shard_003 to: shard_001 chunk: min: { _id: MinKey } max: { _id: -4611686018427387900 }
|
See
SERVER-12515 and SERVER-12638
|
|
Hi Björn Bullerdieck,
we haven't heard back from you for some time, so I'm going to mark this ticket as resolved. If this is still an issue for you please feel free to re-open the ticket and provide the config dump Asya requested above, so we can continue the diagnosis on this issue.
Regards,
Ramón.
|
|
Hi Björn,
Have you had a chance to review Asya's comment above? Are you able to provide the config dump she asked for? We would need that to continue the reproduction of this issue.
Regards,
Thomas
|
|
Thank you for the log, that is helpful - would it be possible to dump the contents of the config db after this happens, zip it up and attach it to this ticket? The logs don't have some details that may be visible in the config DB. You can use mongodump -d config against mongos or any of the three config servers.
|
|
The problem is, the balancing processes do not finish, it always fails.
20140430_1630_mongos1.log
create shard
disable balancer
(try drop databases)
create 10 databases + collections with sharding
activate balancer
(loop starts at 2014-04-30T16:04:50.408+0200 [Balancer] distributed lock 'balancer/VM01-SHARD-TEST:27017:1398866406:41' acquired, ts : 53610302eeb9507e5adc91d7)
|
|
Björn Bullerdieck just because the collection is not pre-split/balanced does not make it corrupt. Sharding is still enabled. The collections will get balanced when the other balancing process finishes with previous round.
The errors I've seen so far on this just show that the right thing is happening - you cannot drop DB that is currently being manipulated by another process (in a sharded cluster it's not a single instantaneous action so it has to be guarded from multiple threads trying to do conflicting things to it).
If by corrupt you mean you cannot insert or query data in it, can you provide the logs and the exact steps that are failing?
|
|
@Thomas
No, it doesn't resolve the problem.
@Asya
"some collections are not usable after creating. the balancer can't balance."
Some collections only have 4 chunks and these have the "migration is already going on"-problem. (vs. 2.4.9 works fine)
Chunks are not moved, so there is no sharding. Only the same log entries.
2014-04-30T12:42:04.373+0200 [mongosMain] db version v2.6.1-rc0
|
...
|
2014-04-30T13:37:50.439+0200 [Balancer] distributed lock 'balancer/VM01-SHARD-TEST:27017:1398854524:41' unlocked.
|
2014-04-30T13:37:56.586+0200 [Balancer] distributed lock 'balancer/VM01-SHARD-TEST:27017:1398854524:41' acquired, ts : 5360e0945e354a9328121f2f
|
2014-04-30T13:37:57.057+0200 [Balancer] ns: ci_400000000000002.test_collection going to move { _id: "ci_400000000000002.test_collection-_id_MinKey", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('5360d3c75e354a9328121d81'), ns: "ci_400000000000002.test_collection", min: { _id: MinKey }, max: { _id: -4611686018427387900 }, shard: "shard_001" } from: shard_001 to: shard_002 tag []
|
2014-04-30T13:37:57.068+0200 [Balancer] ns: ci_400000000000003.test_collection going to move { _id: "ci_400000000000003.test_collection-_id_MinKey", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('5360d3c85e354a9328121d88'), ns: "ci_400000000000003.test_collection", min: { _id: MinKey }, max: { _id: -4611686018427387900 }, shard: "shard_001" } from: shard_001 to: shard_002 tag []
|
2014-04-30T13:37:57.101+0200 [Balancer] moving chunk ns: ci_400000000000002.test_collection moving ( ns: ci_400000000000002.test_collection, shard: shard_001:VM01-SHARD-TEST:20117, lastmod: 1|0||000000000000000000000000, min: { _id: MinKey }, max: { _id: -4611686018427387900 }) shard_001:VM01-SHARD-TEST:20117 -> shard_002:VM01-SHARD-TEST:20217
|
2014-04-30T13:37:57.106+0200 [Balancer] moveChunk result: { who: { _id: "ci_400000000000002.test_collection", state: 1, who: "VM01-SHARD-TEST:20117:1398854615:41:conn15:41", process: "VM01-SHARD-TEST:20117:1398854615:41", when: new Date(1398857361521), why: "migrate-{ _id: MinKey }", ts: ObjectId('5360de9169d2c1e0a8c7c94c') }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: MinKey }" }
|
2014-04-30T13:37:57.107+0200 [Balancer] balancer move failed: { who: { _id: "ci_400000000000002.test_collection", state: 1, who: "VM01-SHARD-TEST:20117:1398854615:41:conn15:41", process: "VM01-SHARD-TEST:20117:1398854615:41", when: new Date(1398857361521), why: "migrate-{ _id: MinKey }", ts: ObjectId('5360de9169d2c1e0a8c7c94c') }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: MinKey }" } from: shard_001 to: shard_002 chunk: min: { _id: MinKey } max: { _id: -4611686018427387900 }
|
2014-04-30T13:37:57.107+0200 [Balancer] moving chunk ns: ci_400000000000003.test_collection moving ( ns: ci_400000000000003.test_collection, shard: shard_001:VM01-SHARD-TEST:20117, lastmod: 1|0||000000000000000000000000, min: { _id: MinKey }, max: { _id: -4611686018427387900 }) shard_001:VM01-SHARD-TEST:20117 -> shard_002:VM01-SHARD-TEST:20217
|
2014-04-30T13:37:57.432+0200 [Balancer] moveChunk result: { who: { _id: "ci_400000000000003.test_collection", state: 1, who: "VM01-SHARD-TEST:20117:1398854615:41:conn15:41", process: "VM01-SHARD-TEST:20117:1398854615:41", when: new Date(1398857362440), why: "migrate-{ _id: MinKey }", ts: ObjectId('5360de9269d2c1e0a8c7c94e') }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: MinKey }" }
|
2014-04-30T13:37:57.432+0200 [Balancer] balancer move failed: { who: { _id: "ci_400000000000003.test_collection", state: 1, who: "VM01-SHARD-TEST:20117:1398854615:41:conn15:41", process: "VM01-SHARD-TEST:20117:1398854615:41", when: new Date(1398857362440), why: "migrate-{ _id: MinKey }", ts: ObjectId('5360de9269d2c1e0a8c7c94e') }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: MinKey }" } from: shard_001 to: shard_002 chunk: min: { _id: MinKey } max: { _id: -4611686018427387900 }
|
2014-04-30T13:37:58.112+0200 [Balancer] distributed lock 'balancer/VM01-SHARD-TEST:27017:1398854524:41' unlocked.
|
2014-04-30T13:38:04.266+0200 [Balancer] distributed lock 'balancer/VM01-SHARD-TEST:27017:1398854524:41' acquired, ts : 5360e09c5e354a9328121f30
|
.. ...
|
trying to drop the collections returns:
mongos> db.dropDatabase()
|
{
|
"code" : 13331,
|
"ok" : 0,
|
"errmsg" : "exception: collection's metadata is undergoing changes. Please try again."
|
}
|
|
|
Hi Björn,
Did you have a chance to look at Asya's questions? Can you repeat your test with version 2.6.1-rc0 and see if that resolves the problem?
Kind Regards,
Thomas
|
|
Could you clarify how the collections are corrupt?
The messages in the logs simply indicate that migration wouldn't be attempted when another migration is already going on.
When you create a database, normally it gets placed on the shard that has the fewest DBs, but since you created all 10 dbs approximately at the same time, all 10 were placed on the same shard. You can avoid this by staggering your DB creations slightly.
|
|
Why all 10 databases got "primary" : "shard_001"? a 2.6.0 Bug/Feature?
|
|
sharding is not usable! Please try and fix this.
tested now with 2.6.0: 20140409_curruptCreatedCollections.rar
some collections are not usable after creating. the balancer can't balance.
collections with only 4 chunks (4 replicaSets) after enable sharding with hashed shard key are currupt.
abstract version:
{ "_id" : "ci_400000000000002", "partitioned" : true, "primary" : "shard_001" }
|
ci_400000000000002.customer_informations
|
shard key: { "_id" : "hashed" }
|
chunks:
|
shard_001 4
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611686018427387900") } on : shard_001 Timestamp(1, 0)
|
{ "_id" : NumberLong("-4611686018427387900") } -->> { "_id" : NumberLong(0) } on : shard_001 Timestamp(1, 1)
|
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611686018427387900") } on : shard_001 Timestamp(1, 2)
|
{ "_id" : NumberLong("4611686018427387900") } -->> { "_id" : { "$maxKey" : 1 } } on : shard_001 Timestamp(1, 3)
|
2014-04-09T10:40:45.919+0200 [Balancer] ns: ci_400000000000002.customer_informations going to move { _id: "ci_400000000000002.customer_informations-_id_MinKey", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('534505405d50c57670350bf3'), ns: "ci_400000000000002.customer_informations", min: { _id: MinKey }, max: { _id: -4611686018427387900 }, shard: "shard_001" } from: shard_001 to: shard_002 tag []
|
2014-04-09T10:40:46.182+0200 [Balancer] moving chunk ns: ci_400000000000002.customer_informations moving ( ns: ci_400000000000002.customer_informations, shard: shard_001:WT-INT34:20117, lastmod: 1|0||000000000000000000000000, min: { _id: MinKey }, max: { _id: -4611686018427387900 }) shard_001:WT-INT34:20117 -> shard_002:WT-INT34:20217
|
2014-04-09T10:40:46.188+0200 [Balancer] moveChunk result: { who: { _id: "ci_400000000000002.customer_informations", state: 1, who: "WT-INT34:20117:1397032274:41:conn4:18467", process: "WT-INT34:20117:1397032274:41", when: new Date(1397032277411), why: "migrate-{ _id: -4611686018427387900 }", ts: ObjectId('53450555809d2ab628e84ef4') }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: MinKey }" }
|
2014-04-09T10:40:46.188+0200 [Balancer] balancer move failed: { who: { _id: "ci_400000000000002.customer_informations", state: 1, who: "WT-INT34:20117:1397032274:41:conn4:18467", process: "WT-INT34:20117:1397032274:41", when: new Date(1397032277411), why: "migrate-{ _id: -4611686018427387900 }", ts: ObjectId('53450555809d2ab628e84ef4') }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: MinKey }" } from: shard_001 to: shard_002 chunk: min: { _id: MinKey } max: { _id: -4611686018427387900 }
|
|
|
to "3. more than one chunk range overlap"
sort on _id was a bad idea. Found no chunks ranges overlap now.
But if i use find() i get:
QueryFailure flag was too many retries of stale version info ( ns : ci_400000000000001.customer_informations, received : 0|0||000000000000000000000000, wanted : 8|79||533d925795b7b0fe7cf515f9, send ) (response was { "$err" : "too many retries of stale version info ( ns : ci_400000000000001.customer_informations, received : 0|0||000000000000000000000000, wanted : 8|79||533d925795b7b0fe7cf515f9, send )", "code" : 13388 }).
|
Type: MongoDB.Driver.MongoQueryException
|
Stack: bei MongoDB.Driver.Internal.MongoReplyMessage`1.ReadFrom(BsonBuffer buffer, IBsonSerializationOptions serializationOptions)
|
bei MongoDB.Driver.Internal.MongoConnection.ReceiveMessage[TDocument](BsonBinaryReaderSettings readerSettings, IBsonSerializationOptions serializationOptions)
|
bei MongoDB.Driver.MongoCursorEnumerator`1.GetReply(MongoConnection connection, MongoRequestMessage message)
|
bei MongoDB.Driver.MongoCursorEnumerator`1.GetFirst()
|
bei MongoDB.Driver.MongoCursorEnumerator`1.MoveNext()
|
bei MangoUI.MUtil.Bson.ReadFrom(MongoCursor`1 cursor)
|
bei MangoUI.MCollections.Fetch(String db, String col, QueryDocument query, Int32 skip, Int32 limit)
|
bei MangoUI.ComBrowseBson.CDUVDoJdTSn6rXJwTMb8(Object , Object , Object , Object , Int32 skip, Int32 limit)
|
bei MangoUI.ComBrowseBson.FetchDocuments()
|
|
|
Maybe a different Problem but with chunks ranges overlap. No logfiles here, tried it on a different system.
different to SERVER-12638:
1. corrupt configs with many chunks
2. balancer freeze on trying to move chunks
chunks:
|
shard_001 349
|
shard_002 11
|
shard_004 8
|
shard_003 8
|
3. more than one chunk range overlap
see chunks from config 20140407_CorruptConfigsWithManyChunks.txt
|
|
Logs and sh.status()
|
|
different to SERVER-12515:
1. not specifying numInitialChunks
2. corrupt configs
expected chunks for all sharded databases
chunks:
|
shard_001 2
|
shard_002 2
|
shard_004 2
|
shard_003 2
|
case 1: aktivated balancer
some databases with imbalance of chunks
Balancer can not move chunks from 3 databases
see createCollectionsBalancerEnabled.rar
case 2: deaktivated balancer on activate sharding
some databases with imbalance of chunks
Balancer can not move chunks from 5 databases
see createCollectionsBalancerDisabled.rar
|
|
Hi Björn,
Can you describe how this is different to the issue tracked on SERVER-12638?
I don't see any chunk range overlaps or gaps in your sh.status() output. Are you concerned about the imbalance of chunks across the shards? I.e. some collections have all chunks on a single shard?
Could you attach the full logfile from when you started the 10 threads?
Thanks,
Thomas
|
Generated at Thu Feb 08 03:31:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.