[SERVER-74076] The totals of query config server 's config.chunks collection very high Created: 16/Feb/23  Updated: 13/Jul/23  Resolved: 13/Jul/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.4.14
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: dongyu si Assignee: Yuan Fang
Resolution: Done Votes: 0
Labels: configsvr, shard
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OS: Linux 3.10.0-957.el7.x86_64 #1 SMP Mon Dec 7 11:30:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Version: 4.4.14


Attachments: PNG File Screenshot 2023-05-04 at 1.22.43 PM.png     PNG File image-2023-02-16-09-01-29-431.png     PNG File image-2023-02-16-09-05-53-558.png     PNG File image-2023-02-16-09-07-18-903.png     PNG File image-2023-02-26-13-35-41-142.png     PNG File image-2023-02-26-13-35-58-748.png     PNG File image-2023-02-26-13-39-43-841.png     PNG File image-2023-02-26-13-40-01-897.png     PNG File image-2023-02-26-13-40-15-644.png     PNG File image-2023-02-28-08-55-45-972.png    
Issue Links:
Related
Operating System: ALL
Participants:

 Description   

Hi, I have deployed a sharded cluster with 3*mongos, 3*config, 3*shard, but I found the totals of query config server 's config.chunks collection is positively correlated with the number of update operations. I think there should be caching the chunks metadata but it seems query config.chunks every operation. Thank you for tell me what could be the reason or it is a bug?

 



 Comments   
Comment by Yuan Fang [ 13/Jul/23 ]

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Comment by Yuan Fang [ 20/Jun/23 ]

Hi 335612970@qq.com,

We still need additional information to diagnose the problem. If this is still an issue for you,  could you please provide the logs and FTDC of the mongos node through the Amphora link? Thank you.

Regards,
Yuan

Comment by Yuan Fang [ 22/May/23 ]

Hi 335612970@qq.com,

I apologize for the delayed response, I've reviewed the FTDC of the config servers and can see the increased query flow resembles the first screenshot you provided (left section). I did not find anything abnormal in the data.

However, based on the first screenshot you provided (right section), I can see there is a significant workload of delete operations during the incident period. Delete operations change the shard versions, which may result in frequent queries to the config server to obtain the most up-to-date information. Therefore, it is puzzling why the queries to the config server are correlated with the delete+update workload shown in the screenshot. To gain a comprehensive understanding of the workload on the clusters, could you also please provide the logs and FTDC of the mongos node through the Amphora link? The mongos.log can help identify routing errors and understand the need for refreshing routing information.

Additionally, apart from the observed performance issue related to the queries to the config server, have you noticed any other issues or concerns regarding the performance of the clusters?

Thank you sincerely for your cooperation and understanding.

Regards,
Yuan

Comment by dongyu si [ 28/Feb/23 ]

Hi Yuan Fang, 

I am sorry to upload the file with full path, I have upload the file again with only filename, please try to see if the files have been upload. Thank you very much. The following pictures are upload results.

 

Comment by Yuan Fang [ 27/Feb/23 ]

Hi 335612970@qq.com,

Thank you for your efforts in uploading the requested datasets. Unfortunately, I am still unable to see the files on my end. However, the screenshots you provided have been helpful in explaining why these files were not actually uploaded. I apologize for any confusion caused by the instructions provided for uploading using the upload portal, but it appears that providing the full path of the file will not work as expected (I have tested uploading with the full path to the file on my end, and encountered a similar issue). I would suggest either:

  • navigate to the current location of the file and replace <filename> with just the file name (e.g. "config_2.tgz") without the preceding path,
  • or alternatively, directly uploading the dataset to this Jira ticket if there is no sensitive data involved.

Please let me know if you made another attempt to upload, and thank you so much for your time!

Regards,
Yuan

Comment by dongyu si [ 26/Feb/23 ]

Hi Yuan Fang,

I have upload the logs and diagnostic.data again. The following pictures are upload results.

Comment by Yuan Fang [ 24/Feb/23 ]

Hi 335612970@qq.com,

Unfortunately, I couldn't find any files in the folder linked to the upload portal. Could you try to upload them to the upload portal again? Please make sure that the files are fully uploaded and you should see the progress after initiating the command. Feel free to let me know if you encounter any issues while uploading.

Regards,
Yuan

Comment by dongyu si [ 22/Feb/23 ]

Hi Yuan Fang,

Thank you for your response. I have upload the five config server member's logs and diagnostic.data, I appreciate with your confirmation and cooperation. 

Comment by Yuan Fang [ 21/Feb/23 ]

Hi 335612970@qq.com,

Thank you for your report. Based on what you've described, it seems like you're experiencing high query traffic to the config server, and you suspect that the routing table cache is not being used effectively, or may not be used at all. In order to investigate this further, we require more context. I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Regards,
Yuan

Comment by dongyu si [ 17/Feb/23 ]

when our business service has stopped, the query ops of config server decrease until 40 ~ 60 ops/s, it seems properly. I found our business service has high upsert operations without sharding key, is it a reason for the high query ops of config server? But, why not found by cache? 

Generated at Thu Feb 08 06:26:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.