[SERVER-29810] Mongos no more refreshing chunks and trying impossible splits Created: 23/Jun/17  Updated: 29/Jul/17  Resolved: 23/Jun/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Slawomir Lukiewski Assignee: Esha Maharishi (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-28418 make the split command on mongod retu... Closed
Operating System: ALL
Participants:

 Description   

Hello,

Regularly our mongos stop refreshing chunks from config serv for some collections. And when trying to split chunk, produces "IncompatibleShardingMetadata: Unable to find chunk with the exact bounds" if the chunk was already split by another mongos.

Our Mongo cluster details :

  • Many shards + config replica set, each formed by 3 members (1 primary + 2 secondary)
  • 2 mongos
  • Balancer is disabled
  • Package version 3.4.4, OS: Debian 8 Jessie
  • Servers: 6 cores Xeon CPU, 64GB RAM, ~3To SSD, ext4 file system
  • ~ 40 collections in 1 DB
  • Many writes and reads

Classic scenario (shard, collection and fields names and values was replaced) :

From mongos A logs :

2017-06-21T14:46:42.087+0200 I SHARDING [conn6] Refreshing chunks for collection stats.collectionName based on version 9743|18393||5320f5e96789f4d11460c4a0
2017-06-21T14:46:42.129+0200 I SHARDING [CatalogCacheLoader-1] Refresh for collection stats.collectionName took 42 ms and found version 9743|18393||5320f5e96789f4d11460c4a0

From mongos B logs :

2017-06-21T14:51:05.844+0200 I SHARDING [conn3103094] autosplitted stats.collectionName chunk: shard: shardName, lastmod: 9743|18367||5320f5e96789f4d11460c4a0, [{ _id: { d: 20170621, a: 78, c: 909090, d: 12345678 } }, { _id: { d: 20170621, a: 4444, b: 111111111, c: 222222, d: 3333333 } }) into 3 parts (splitThreshold 67108864) (migrate suggested, but no migrations allowed)

From mongos A logs :

2017-06-21T14:55:21.233+0200 I SHARDING [conn379] Split chunk { splitChunk: "stats.collectionName", configdb: "csReplSet/172.16.18.28:27025,172.16.18.3:27025,172.16.18.30:27025", from: "shardName", keyPattern: { _id: 1.0 }, shardVersion: [ Timestamp 9743000|149465, ObjectId('5320f5e96789f4d11460c4a0') ], min: { _id: { d: 20170621, a: 4444, b: 111111111, c: 222222, d: 3333333 } }, max: { _id: { d: 20170621, a: 4444, b: 121212121, c: 343434, d: 5656565 } }, splitKeys: [ { _id: { d: 20170621, a: 4444, b: 555555555, c: 666666, d: 7777777 } }, { _id: { d: 20170621, a: 4444, b: 888888888, c: 999, d: 000000 } } ] } failed :: caused by :: IncompatibleShardingMetadata: *Unable to find chunk with the exact bounds* [{ _id: { d: 20170621, a: 4444, b: 111111111, c: 222222, d: 3333333 } }, { _id: { d: 20170621, a: 4444, b: 121212121, c: 343434, d: 5656565 } }) at collection version 9743|18399||5320f5e96789f4d11460c4a0

We can see that between refresh chunk and split try on mongos A, the other mongos already split that chunk. So the split try faild.

The problem is that sometimes a mongos suddenly stops to refresh a collection until we restart / force it, so for a long time. And in that cases after few days the mongos is doing bigger and bigger split tries :

2017-06-20T11:39:15.052+0200 I SHARDING [conn2766148] warning: log line attempted (53kB) over max size (10kB), printing beginning and end ... Split chunk { splitChunk: "stats.collectionName", configdb: "csReplSet/172.16.18.28:27025,172.16.18.3:27025,172.16.18.30:27025", from: "shardName", keyPattern: { _id: 1.0 }, shardVersion: [ Timestamp 3000|18274, ObjectId('5667717d46b7ddcd61ef5459') ], min: { _id: { d: 20170611, a: 111111, b: 2222222, c: 333, d: 333 } }, max: { _id: MaxKey }, splitKeys: [ ...... VERY LONG KEYS LIST ...... ] } failed :: caused by :: IncompatibleShardingMetadata: Unable to find chunk with the exact bounds [{ _id: { d: 20170611, a: 111111, b: 2222222, c: 333, d: 333 } }, { _id: MaxKey }) at collection version 3|19540||5667717d46b7ddcd61ef5459

"_id.d" is the insert date, here was 20170611 but as you can see the log entry date is 2017-06-20. The diff is 9 days, 9 days of failed split tries. During this period, we found no chunk refresh in logs for the concerned collection. Theses big split tries slows a lot our shards (long splitVector queries on primary members) which is very troublesome for us.

So we have to execute regularly a db.adminCommand("flushRouterConfig") on mongos to force refresh.

Thank you in advance for your help.

Best regards,
Slawomir



 Comments   
Comment by Kaloian Manassiev [ 23/Jun/17 ]

We have confirmed that this is indeed the same problem as SERVER-28418 so I am closing it as duplicate.

Please follow SERVER-28418 for more information on when it gets released.

Best regards,
-Kal.

Comment by Slawomir Lukiewski [ 23/Jun/17 ]

Hi Kaloian,

Thanks for your answer !
Yes, this problem probably started happening only after upgrading to 3.4.4. We stayed only few weeks on 3.4.3 (before was 3.2.8) so not 100% sure but the chunk refreshes optimization you're talking about seems to fit our case. On 3.2.8, we was also experiencing some IncompatibleShardingMetadata errors but it wasn't a real problem because refreshes where more regular.

Yes SERVER-28418 should be a very well fix !
I look forward to the 3.4.6 release !

Best regards,
Slawomir

Comment by Kaloian Manassiev [ 23/Jun/17 ]

Hi slluk-sa,

Thank you for reporting this issue and sorry for the inconvenience it is causing you having to manually refresh the routing cache.

Please correct me if I am wrong, but this problem should have started happening only after you upgraded to 3.4.4 - is that correct? In that version we optimized chunk refreshes which were happening too frequently and this is one of the use cases which got regressed. I believe this is a duplicate of SERVER-28418 which we have fixed in the latest master branch and it is waiting to be backported to version 3.4.6.

We'll look into it and report the details here.

Thanks again for your report.

Best regards,
-Kal.

Generated at Thu Feb 08 04:21:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.