[SERVER-29149] SHARDING CatalogCacheLoader + Balancer filling up to 2gb of log in ~4h Created: 11/May/17  Updated: 24/Aug/17  Resolved: 17/Jul/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Stephane Marquis Assignee: Kaloian Manassiev
Resolution: Duplicate Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-28418 make the split command on mongod retu... Closed
Operating System: ALL
Steps To Reproduce:

Would really like to know :|

Participants:

 Description   

We have deployed a new production set-up using the same configuration that we have for the other one, however for this one the mongod configuration server keeps spamming us with these message non stop:

2017-05-11T17:12:18.303+0000 I SHARDING [Balancer] Refreshing chunks for collection (collection) based on version 1|0||590245312abfdf91c6d415fe

2017-05-11T17:12:18.303+0000 I SHARDING [CatalogCacheLoader-25] Refresh for collection (Collection) took 0 ms and found version 1|0||590245312abfdf91c6d415fe

This lead us to having to rotate and delete all log after a 1 hour period. it is started in quiet mode with verbosity set to 0. The chunk size have been modified to 128mb and it's been like that since we've set-up the server and I can't figure out why it's happening only on this 3 machine clusters

Thanks !
Stéphane



 Comments   
Comment by Kaloian Manassiev [ 17/Jul/17 ]

Thanks for the confirmation, oleg@evergage.com. I am going to close this ticket as duplicate of SERVER-28418.

smarquis, if this issue persists for you for any reason, please open a new ticket and include a snippet from the relevant logs so we can investigate further.

Best regards,
-Kal.

Comment by Oleg Rekutin [ 13/Jul/17 ]

kaloian.manassiev I can confirm that after the upgrade, the log volume has decreased from 12MB/min to 0.10MB/min on the primary config server that was affected.

You can consider this resolved.

Comment by Kaloian Manassiev [ 07/Jul/17 ]

Hi smarquis, oleg@evergage.com,

MongoDB 3.4.6 has been released. Would it be possible to upgrade to this version and let me know if you are still seeing abnormal log generation?

In the meantime we are looking for ways to reduce the amount of refresh information logging without sacrificing our ability to diagnose problems post factum.

Thank you very much for your help and patience.

Best regards,
-Kal.

Comment by Oleg Rekutin [ 26/Jun/17 ]

I think that SERVER-28418 is the root cause of massively spammy balancer logs. Can't wait for 3.4.6..

Comment by Oleg Rekutin [ 10/Jun/17 ]

If I stop the balancer, the heavy refreshing behavior stops and log spam goes away. I think this might be caused by SERVER-29423, so it is possible that you might see this only for clusters that have many collections that need to be balanced.

Comment by Stephane Marquis [ 12/May/17 ]

I've checked the last 3 rounds of log and the message are always :

2017-05-12T20:58:27.298+0000 I SHARDING [Balancer] Refreshing chunks for collection (Collection) based on version 3|1||5908c72e40f2ee4a9952df08
2017-05-12T20:58:27.299+0000 I SHARDING [CatalogCacheLoader-77] Refresh for collection (Collection) took 0 ms and found version 3|1||5908c72e40f2ee4a9952df08

The version are varying by collection but that's it :-/ i'm not seeing any error

Comment by Kaloian Manassiev [ 12/May/17 ]

I am wondering whether the extra logging on the 3.4.4 cluster could be due to some error. Is the message for different collections or always the same by any chance?

Comment by Stephane Marquis [ 12/May/17 ]

Weirdly enough, on our other cluster (3.4.2) there's no message being sent at the 10 seconds interval. I've validated the config server configuration file and there's nothing related to log verbosity or systemLog.quiet in there, they are started like that:

mongod 10988 26.3 53.6 40478940 35370896 ? Sl Apr16 9899:15 /usr/bin/mongod -f /etc/mongod.conf
mongod 11075 0.7 3.0 3068788 2024848 ? Sl Apr16 265:14 /usr/bin/mongod -f /etc/mongod_config.conf

There's not the quiet switch either, could it be related to one cluster being on CentOS 6 and the other one being on CentOS 7 ?

Thanks !

Comment by Kaloian Manassiev [ 12/May/17 ]

I believe you need to use systemLog.quiet in the config file not as a parameter on the command line.

Comment by Stephane Marquis [ 12/May/17 ]

Hi Kaloian,

Could there be a bug with the quiet switch ? All the nodes are started as follow:

mongod 7111 8.8 52.6 36635156 34640692 ? Sl May08 540:05 /usr/bin/mongod --quiet -f /etc/mongod.conf run
mongod 13080 1.3 0.1 592004 94944 ? Sl May10 42:54 /usr/bin/mongos --quiet -f /etc/mongos.conf
mongod 13143 0.8 0.6 1260896 413132 ? Sl May10 28:47 /usr/bin/mongod --quiet -f /etc/mongod_config.conf run

This cluster was a new one (no upgrade) while the other one have more collection but the log partition has ~60g of log so it wasn't an issue on it. I have to leave for a few but I'll check it when I come back to see if the same amount of log is getting outputted.

Comment by Kaloian Manassiev [ 12/May/17 ]

Hi Stephane,

Sorry that you are experiencing this problem. We will use this ticket to figure out a more efficient way to log refreshes on the config server and post an update.

For now, unfortunately there is no way to disable these messages without disabling all the logging on the node. However, I would strongly recommend against this because it severely limits the ability to diagnose problems.

In the mean time, can you please tell me whether this is a brand new cluster setup or you upgraded an existing cluster to 3.4.4? The reason I am asking is that I checked the code and the same logging is present in 3.4.2 as well. Is it that your older cluster just has a smaller number of sharded collections?

Best regards,
-Kal.

Comment by Stephane Marquis [ 12/May/17 ]

Hello Kaloian,

You are right our old cluster is 3.4.2 which is why we don't get them.

There is no error shown on this particular cluster we have a total of :

434 collections, unfortunately in almost all collection name there is sensitive information and going throught the log to change them all would take a lot of time. After further validation, once we clear all the log it last for a day (clean-up at 6 am and we get the warning that the log partition is full around 5h am) which would be ~1.7gb of log during that period (4mb * 434 collections ~1736mb) on a 2gb partition.

Is there any way we can turn them off ?

Thanks !

Comment by Kaloian Manassiev [ 12/May/17 ]

Hi smarquis,

The messages you have listed were introduced in version 3.4.4 as a result of making the metadata refresh happen asynchronously and block other operations. As a result we seem to have doubled the amount of log lines around each of the balancer's rounds (the first message was always there).

I suspect that your old cluster is running 3.4.2 and the new cluster is running 3.4.4 that's why you are seeing it only on one of the clusters.

The balancer round messages should show up once every 10 seconds and would be 2 per collection, so the logging impact per collection would be about 4MB per day. This is not great, but not anywhere close to 4GB.

Are you seeing them show up more often and what is the number of collections that you have? Are there any errors shown? Would it be possible to attach a snippet from one of these logs?

Best regards,
-Kal.

Generated at Thu Feb 08 04:20:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.