[SERVER-40130] Improve multi-threading Created: 14/Mar/19 Updated: 16/May/19 Resolved: 20/Mar/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, Storage |
| Affects Version/s: | 3.4.17 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Pichardie kévin | Assignee: | Eric Sedor |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Participants: | |||||||||
| Description |
|
Hello, at sendinblue we use mongodb since a long time and have a few clusters running with big datasets. We currently struggle with our cluster using Wiretiger, slow start, replication not able to succeed or is very long.... Here is information about the sizing of cluster we have here is some stats :
I have currently isolated one shard of a cluster to do some debugging about the bottlenek we encounter. I troubleshoot issues on a secondary which is currently on a google cloud instance seems to freeze a lot while starting, play oplogs ... We identified that the shard looks like to have some process running as mono thread or not efficently multi-threaded. This instance run in instance of 16 vCPU at 2.5GHz and 96G of memory.
At the starting of the mongod instance it take very long time and statistics on server seems to show that one or 2 vCPU are effectively working. We have found some relative information here : https://jira.mongodb.org/browse/SERVER-27700?focusedCommentId=1480933&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-1480933 But as we already use a version that have the improvments, we shouldn't strugle on replication op are evictions. Here is the current configuration we use :
We have tried some modification of WireTiger based on this document bellow and the comment on the previous Jira link : https://source.wiredtiger.com/2.9.0/group__wt.html#gab435a7372679c74261cb62624d953300 Currently my configuration is : ```net: Is there any setup that permit to increase the multi processing at startup and on the replication process because seems that some process are not. ```shard1-:# ps -T -p 32107 ``` During startup we clearly see that the server is stuck on this with one process with 100% cpu and the rest doing nothing almost : ```2019-03-14T17:38:24.611+0000 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=40960M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),verbose=(recovery_progress),file_manager=(close_handle_minimum=10000,close_idle_time=3600,close_scan_interval=10)
ls free -mh ``` This step take 40min and using only 1 cpu. Can you help on this ? I know you will need more info that i can probably provide.
Thanks in advance
|
| Comments |
| Comment by Pichardie kévin [ 21/Mar/19 ] | ||
|
Hello Eric,
Yes but this case isn't updated since 2016 :s. Should i update on this case ? Can you also share your finding to keep traces of bottleneck, and that we can summup on case Best regards, Kévin | ||
| Comment by Eric Sedor [ 20/Mar/19 ] | ||
|
At this time I'm going to close this ticket as a duplicate. Can you please see | ||
| Comment by Eric Sedor [ 20/Mar/19 ] | ||
|
kpichardie, with these logs we've been able to rule out some other possibilities and currently believe that this is a case of | ||
| Comment by Eric Sedor [ 20/Mar/19 ] | ||
|
Thanks Kévin, this helps; We are taking a look and will let you know. | ||
| Comment by Pichardie kévin [ 20/Mar/19 ] | ||
|
Hello Eric, I have retried with normal stop and time is mostly the same. See log attached on secure protal. Let me know if you need more info. Kévin | ||
| Comment by Pichardie kévin [ 20/Mar/19 ] | ||
|
Hello Eric,
I have normally stopped the process using systemd but i think it's also take a long time to stop arround 5 min and i will check but systemd probably killed it. As default timeout is 5 minutes i do think mongo is killed.
I will try increase the timeout to check the time to restart.
Kévin
| ||
| Comment by Eric Sedor [ 19/Mar/19 ] | ||
|
Hi Kévin, the bulk of the time spent during restart appears to involve the work necessary to recover from an unclean shutdown. Can you clarify how you are stopping the node? | ||
| Comment by Pichardie kévin [ 19/Mar/19 ] | ||
|
Hello Eric, Please find the file of the logs during restart.
Best regards, Kévin | ||
| Comment by Eric Sedor [ 19/Mar/19 ] | ||
|
Of course Kévin; I've generated an uploader for you here | ||
| Comment by Pichardie kévin [ 19/Mar/19 ] | ||
|
Hello Eric, Can i have a link on secure portal to upload log information. Sorry i missed it in comment but it switch as secondary just after STARTUP2 2019-03-14T23:23:40.186+0000 I REPL [rsSync] transition to SECONDARY I will try to do the restart with iostat but we don't see any limitation on this part. I have uploaded the iostat for my restart test, can you provide a secure link for the rest of files ? Best regards, | ||
| Comment by Eric Sedor [ 18/Mar/19 ] | ||
|
kpichardie, we are tracking some known inefficiencies with high file counts in It may also be helpful if you can repeat the collection of diagnostic data during the secondary's restart while also running the following:
This will allow us to get disk metrics for additional periods during the startup which are not available in the current diagnostic data. Thanks in advance! | ||
| Comment by Pichardie kévin [ 18/Mar/19 ] | ||
|
Hello, I have uploaded diagnostics from the tests i made last thursday so the 14th and week-end. Basically the problem is impacting replication (we frequently have lags) and startup that is slow. Problem is pretty constant on the replication we see several jump in the delay and we see that one of the CPU is pretty high while others are not overloaded. But during 14th i was trying some custom wiretiger configuration like explained previously but no success. Last restart with classic configuration params took a very long time : 2019-03-14T17:38:24.599+0000 I CONTROL [main] ***** SERVER RESTARTED *****
I believe that some processes are mono threaded and causing slow processing during startup and for replication as 1 CPU is saturated.
| ||
| Comment by Eric Sedor [ 14/Mar/19 ] | ||
|
Hello kpichardie and thanks for the information so far. To best help us understand the behavior you're observing can you please archive (tar or zip) the $dbpath/diagnostic.data directory (described here) for all nodes during a representative incident and attach it to this ticket? Timestamps for said incidents will help us target our examination of this data. |