[SERVER-29810] Mongos no more refreshing chunks and trying impossible splits Created: 23/Jun/17 Updated: 29/Jul/17 Resolved: 23/Jun/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Slawomir Lukiewski | Assignee: | Esha Maharishi (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Hello, Regularly our mongos stop refreshing chunks from config serv for some collections. And when trying to split chunk, produces "IncompatibleShardingMetadata: Unable to find chunk with the exact bounds" if the chunk was already split by another mongos. Our Mongo cluster details :
Classic scenario (shard, collection and fields names and values was replaced) : From mongos A logs :
From mongos B logs :
From mongos A logs :
We can see that between refresh chunk and split try on mongos A, the other mongos already split that chunk. So the split try faild. The problem is that sometimes a mongos suddenly stops to refresh a collection until we restart / force it, so for a long time. And in that cases after few days the mongos is doing bigger and bigger split tries :
"_id.d" is the insert date, here was 20170611 but as you can see the log entry date is 2017-06-20. The diff is 9 days, 9 days of failed split tries. During this period, we found no chunk refresh in logs for the concerned collection. Theses big split tries slows a lot our shards (long splitVector queries on primary members) which is very troublesome for us. So we have to execute regularly a db.adminCommand("flushRouterConfig") on mongos to force refresh. Thank you in advance for your help. Best regards, |
| Comments |
| Comment by Kaloian Manassiev [ 23/Jun/17 ] |
|
We have confirmed that this is indeed the same problem as Please follow Best regards, |
| Comment by Slawomir Lukiewski [ 23/Jun/17 ] |
|
Hi Kaloian, Thanks for your answer ! Yes Best regards, |
| Comment by Kaloian Manassiev [ 23/Jun/17 ] |
|
Hi slluk-sa, Thank you for reporting this issue and sorry for the inconvenience it is causing you having to manually refresh the routing cache. Please correct me if I am wrong, but this problem should have started happening only after you upgraded to 3.4.4 - is that correct? In that version we optimized chunk refreshes which were happening too frequently and this is one of the use cases which got regressed. I believe this is a duplicate of We'll look into it and report the details here. Thanks again for your report. Best regards, |