[SERVER-74980] excessive logging of "Refreshed cached collection" log messages introduced in 5.0 Created: 17/Mar/23 Updated: 29/Oct/23 Resolved: 12/May/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0, 7.0.0-rc1, 6.0.7, 5.0.19 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ian Springer | Assignee: | Sergi Mateo Bellido |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Sharding EMEA
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Requested: |
v7.0, v6.3, v6.0, v5.0
|
||||||||||||
| Steps To Reproduce: | Run a 5.0.15 sharded cluster with lots of collections |
||||||||||||
| Sprint: | Sharding EMEA 2023-04-03, Sharding EMEA 2023-04-17, Sharding EMEA 2023-05-01, Sharding EMEA 2023-05-15 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
We recently upgraded one of our clusters from 4.4.16 to 5.0.15. Prior to upgrading we saw about 50K "Refreshed cached collection" log messages per hour. Now we see about 50K per minute - an increase by 2 orders of magnitude. Looking at the source code, it looks like the lookupCollection method in catalog{_}cache.cpp was refactored in 5.0, resulting in a regression where the "Refreshed cached collection" message is always logged at level 0, versus in 4.4 where it was only logged at level 0 on the very first call or when the version had changed. From 4.4 (https://github.com/mongodb/mongo/blob/v4.4/src/mongo/s/catalog_cache.cpp#L756-L774):
From 5.0 (https://github.com/mongodb/mongo/blob/v5.0/src/mongo/s/catalog_cache.cpp#L667-L674):
So the 4.4 logic was to to log at level 0 if it's the very first call for the collection (!existingRoutingInfo) or if the version has changed. In 5.0, the logic changed to log at level 0 if it's any call other than the very first call (isIncremental) or the version has changed. I believe this change in logic was unintentional, as it results in a huge amount of level 0 log spam. I think what was intended is:
Notice "isIncremental" was changed to "!isIncremental". |
| Comments |
| Comment by Githook User [ 22/May/23 ] |
|
Author: {'name': 'Sergi Mateo Bellido', 'email': 'sergi.mateo-bellido@mongodb.com', 'username': 'smateo'}Message: (cherry picked from commit dcd8c050fb807cd6a30f1c3f833f4be23c22fdcf) |
| Comment by Githook User [ 22/May/23 ] |
|
Author: {'name': 'Sergi Mateo Bellido', 'email': 'sergi.mateo-bellido@mongodb.com', 'username': 'smateo'}Message: (cherry picked from commit dcd8c050fb807cd6a30f1c3f833f4be23c22fdcf) |
| Comment by Githook User [ 15/May/23 ] |
|
Author: {'name': 'Sergi Mateo Bellido', 'email': 'sergi.mateo-bellido@mongodb.com', 'username': 'smateo'}Message: (cherry picked from commit dcd8c050fb807cd6a30f1c3f833f4be23c22fdcf) |
| Comment by Githook User [ 11/May/23 ] |
|
Author: {'name': 'Sergi Mateo Bellido', 'email': 'sergi.mateo-bellido@mongodb.com', 'username': 'smateo'}Message: |
| Comment by Matt Panton [ 10/May/23 ] |
|
Hello Scott, I am Matt a Product Manager from MongoDB and I would love to talk in more detail about your experience operating a sharded MongoDB cluster and to learn how you're using custom utilities to administer the cluster in a way that delivers for your needs. I've reached out to you directly via email. Thanks! |
| Comment by Scott Glajch [ 09/May/23 ] |
|
First off, thank you so much for your attention here. I know I was very um, direct, in my comment above, but it's a huge pain we have no way to mitigate without a fix from you. Separately, I am curious about the wording about "on steady state, i.e. the cluster is balanced." Is this an expectation from the mongodb team, that a cluster would be considered "balanced" ever? After running mongodb for almost a decade I can say that I've never found a cluster to be balanced once it reaches a certain size. Once your collections start having thousands, or tens or hundreds of thousands, of chunks in them, there will always be some constant ongoing "inbalances" across the collections, and therefore mongo will always be having something to move during the balancer rounds. In fact any time our clusters have had any meaningful balancer work to do (i.e. we've added new shards, or we have 1 collection that has a lot of data inserted into it before the balancer can "catch up" to it, etc), a major gripe I have is that the mongo balancer seems to have no logic to it with regards to what chunks it decides to move, so it sort of moves whatever chunks it finds first. Instead of prioritizing chunks with higher imbalances (either by # of chunks for larger collections, or by % of chunks for lower sized collections). So instead we have our own utilities to turn off the balancer, do our own chunks moves manually, and turn it back on later (which runs about 30-50x faster than the builtin mongo balancer, even with us turning on the wait for deletes and writeconcern flags). I guess mongodb balancing philosophy isn't in scope for this bug, but I was curious. |
| Comment by Sergi Mateo Bellido [ 05/May/23 ] |
|
Hi sglajch@evergage.com , Thank you for the extra information: on the config server forced refreshes are triggered periodically as part of each balancing round. I have created a cluster with 1000 sharded collections and verified that: Thank you! |
| Comment by Scott Glajch [ 28/Apr/23 ] |
|
We have 50,000 entries of this log line every minute. If you're saying that we're having 50,000 new collection versions every minute, or that someone is telling the server to force refresh so often? I think it's more likely that the primary configserver (where we see these messages), just periodically refreshes its metadata and loops over every collection, and this log level change means that we get 1 log line for every collection every N time, which is wayyyy too spammy. |
| Comment by Sergi Mateo Bellido [ 24/Mar/23 ] |
|
In theory we are only executing that function when:
Having those logs will help us to understand the internal state of the cache and why we are perfoming those refreshes. Thank you! |
| Comment by Ian Springer [ 23/Mar/23 ] |
Right, so since it is almost always true, you don't want to log the message every time it's true right? |
| Comment by Sergi Mateo Bellido [ 23/Mar/23 ] |
|
Hi ian.springer@salesforce.com, isIncremental in this case means whether we have a previous version of the routing information for that namespace, so it is very likely that this condition is true. The other expression, i.e. newComparableVersion != previousVersion, just verifies whether the new version is the same as the previous one. So from a semantic point of view, the logging logic is very similar to the previous one. The amount of logging that you are mentioning is huge. Would it be possible to see those logs (the ones about refreshes) to understand what's happening? I find it very surprising to generate 50K new versions per minute (new versions are generated as part of a DDL operation (e.g. moveChunk)). Thank you! |
| Comment by Yuan Fang [ 17/Mar/23 ] |
|
Hi ian.springer@salesforce.com, Thank you for your report. Passing this along for the team to investigate. Please continue to watch for updates. Regards, |