Resolution: Duplicate
Major - P3
Affects Version/s: 3.4.4
Component/s: Sharding
Regularly our mongos stop refreshing chunks from config serv for some collections. And when trying to split chunk, produces "IncompatibleShardingMetadata: Unable to find chunk with the exact bounds" if the chunk was already split by another mongos.
Our Mongo cluster details :
- Many shards + config replica set, each formed by 3 members (1 primary + 2 secondary)
- 2 mongos
- Balancer is disabled
- Package version 3.4.4, OS: Debian 8 Jessie
- Servers: 6 cores Xeon CPU, 64GB RAM, ~3To SSD, ext4 file system
- ~ 40 collections in 1 DB
- Many writes and reads
Classic scenario (shard, collection and fields names and values was replaced) :
From mongos A logs :
2017-06-21T14:46:42.087+0200 I SHARDING [conn6] Refreshing chunks for collection stats.collectionName based on version 9743|18393||5320f5e96789f4d11460c4a0 2017-06-21T14:46:42.129+0200 I SHARDING [CatalogCacheLoader-1] Refresh for collection stats.collectionName took 42 ms and found version 9743|18393||5320f5e96789f4d11460c4a0
From mongos B logs :
2017-06-21T14:51:05.844+0200 I SHARDING [conn3103094] autosplitted stats.collectionName chunk: shard: shardName, lastmod: 9743|18367||5320f5e96789f4d11460c4a0, [{ _id: { d: 20170621, a: 78, c: 909090, d: 12345678 } }, { _id: { d: 20170621, a: 4444, b: 111111111, c: 222222, d: 3333333 } }) into 3 parts (splitThreshold 67108864) (migrate suggested, but no migrations allowed)
From mongos A logs :
2017-06-21T14:55:21.233+0200 I SHARDING [conn379] Split chunk { splitChunk: "stats.collectionName", configdb: "csReplSet/,,", from: "shardName", keyPattern: { _id: 1.0 }, shardVersion: [ Timestamp 9743000|149465, ObjectId('5320f5e96789f4d11460c4a0') ], min: { _id: { d: 20170621, a: 4444, b: 111111111, c: 222222, d: 3333333 } }, max: { _id: { d: 20170621, a: 4444, b: 121212121, c: 343434, d: 5656565 } }, splitKeys: [ { _id: { d: 20170621, a: 4444, b: 555555555, c: 666666, d: 7777777 } }, { _id: { d: 20170621, a: 4444, b: 888888888, c: 999, d: 000000 } } ] } failed :: caused by :: IncompatibleShardingMetadata: *Unable to find chunk with the exact bounds* [{ _id: { d: 20170621, a: 4444, b: 111111111, c: 222222, d: 3333333 } }, { _id: { d: 20170621, a: 4444, b: 121212121, c: 343434, d: 5656565 } }) at collection version 9743|18399||5320f5e96789f4d11460c4a0
We can see that between refresh chunk and split try on mongos A, the other mongos already split that chunk. So the split try faild.
The problem is that sometimes a mongos suddenly stops to refresh a collection until we restart / force it, so for a long time. And in that cases after few days the mongos is doing bigger and bigger split tries :
2017-06-20T11:39:15.052+0200 I SHARDING [conn2766148] warning: log line attempted (53kB) over max size (10kB), printing beginning and end ... Split chunk { splitChunk: "stats.collectionName", configdb: "csReplSet/,,", from: "shardName", keyPattern: { _id: 1.0 }, shardVersion: [ Timestamp 3000|18274, ObjectId('5667717d46b7ddcd61ef5459') ], min: { _id: { d: 20170611, a: 111111, b: 2222222, c: 333, d: 333 } }, max: { _id: MaxKey }, splitKeys: [ ...... VERY LONG KEYS LIST ...... ] } failed :: caused by :: IncompatibleShardingMetadata: Unable to find chunk with the exact bounds [{ _id: { d: 20170611, a: 111111, b: 2222222, c: 333, d: 333 } }, { _id: MaxKey }) at collection version 3|19540||5667717d46b7ddcd61ef5459
"_id.d" is the insert date, here was 20170611 but as you can see the log entry date is 2017-06-20. The diff is 9 days, 9 days of failed split tries. During this period, we found no chunk refresh in logs for the concerned collection. Theses big split tries slows a lot our shards (long splitVector queries on primary members) which is very troublesome for us.
So we have to execute regularly a db.adminCommand("flushRouterConfig") on mongos to force refresh.
Thank you in advance for your help.
Best regards,
- duplicates
SERVER-28418 make the split command on mongod return a stale version error if the requested chunk bounds are not found
- Closed