[SERVER-53899] TaskExecutor CPU fills up instantly Created: 20/Jan/21  Updated: 08/Jan/24  Resolved: 22/Jan/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Xin Wang Assignee: Bruce Lucas (Inactive)
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File conns.png     PNG File image-2021-01-20-17-44-34-407.png     PNG File image-2021-01-21-00-16-26-376.png     PNG File image-2021-01-22-12-50-18-538.png     File mongos.diagnostic.data.tar.gz    
Issue Links:
Duplicate
duplicates SERVER-53853 Large buildup of mongos to mongod con... Closed
Related
related to SERVER-54504 Disable taskExecutorPoolSize for Linux Closed
Operating System: ALL
Steps To Reproduce:

6 mongos with taskExecutorPoolSize=4 and 8Core

3 shard with PSSSSH

 

use YCSB to pressure it:

./bin/ycsb run mongodb -P workloads/custom -s -threads 48 -p mongodb.url=xxx

 

YCSB workload:

recordcount=100000
operationcount=100000000
workload=site.ycsb.workloads.CoreWorkload
readallfields=true
readproportion=0.7
updateproportion=0.1
insertproportion=0.2
scanproportion=0
maxscanlength=1000
readmodifywriteproportion=0
insertorder=order
writeallfields=false
mongodb.readPreference=secondary
mongodb.maxconnections=200
requestdistribution=uniform
fieldlength=30
fieldcount=10
hotspotdatafraction=0.2
hotspotopnfraction=0.8
maxexecutiontime=600
table=pressure

Participants:

 Description   

we have a shardCluster with 6 mongos and 3 shard. 

Each mongos use 8 Core controlled by cgroup, and taskExecutorPoolSize is 4

Each shard has 1 primary , 4 secondaries and 1 hidden.

 

we use ycsb to pressure test it with 48 thread.

Anything goes OK,But there may be steep drop(both CPU&opcounters) appeared in mongos once or twice for a 10min pressure test,

When all things go ok, the CPU like:
S 35.0 0.0 8:35.77 TaskExe.rPool-1
S 35.0 0.0 8:52.79 TaskExe.rPool-2
S 35.0 0.0 8:32.40 TaskExe.rPool-3
S 30.0 0.0 9:18.88 TaskExe.rPool-0

 

When the steep drop happened,someone TaskExecutor CPU fill up like:

R 99.9 0.0 8:54.62 TaskExe.rPool-2
S 10.0 0.0 9:19.35 TaskExe.rPool-0
S 10.0 0.0 8:36.19 TaskExe.rPool-1
R 5.0 0.0 8:32.80 TaskExe.rPool-3

 

pstack result for TaskExecutor with CPU filled up:

Thread 85 (Thread 0x7f01c6203700 (LWP 129527)):
#0 0x00007f01cedb86d0 in sha256_block_data_order_avx2 ()
#1 0x00007f01cedb9935 in SHA256_Update ()
#2 0x00007f01ced58817 in HMAC_Init_ex ()
#3 0x00007f01cec04e0f in mongo::SHA256BlockTraits::computeHmac( xxx )

 

perf top result when CPU full happened: about 7.32% for OPENSSL_cleanse.

 

There are also slow log appeared in mongos, while secondary's log do not contain slow log.The reason is that the connectionPool in TaskExecutor with CPU filled up has many requests to be sent, the connectionPool stats log is :

Updating controller for host:port with State: { requests: 19, ready: 0, pending: 2, active: 1, isExpired: false }

The request continues to grow, and pending is always 2. I think it just the result of TaskExecutor CPU full

 

There 2 things worth mentioning

  • when we use PSH for every shard ,everything goes ok, mongos CPU just fill up but no steep drop
  • we use taskExecutorPoolSize=8, may ycsb 48 threads goes well(also steep drop sometimes), but 96 threads still have problems

 

How do I solved this problem?

 



 Comments   
Comment by Bruce Lucas (Inactive) [ 22/Jan/21 ]

Thank you for the update. I'll close this ticket as a duplicate of SERVER-53853 as you are already watching that ticket. Please feel free to re-open this or open a new ticket if SERVER-53853 doesn't resolve your problem.

Comment by Xin Wang [ 22/Jan/21 ]

I also think this is the same issue with SERVER-53853 , when our initial pressure test using taskExecutorPoolSize = 1  and the slowlog & cpu filled up happend, we see mongos connect to all the secondaries with continuously rising connections, which is same with SERVER-53853 ,but a little difference is that all the second that ticket is only one secondary with connection rise

 

About test with taskExecutorPoolSize = 1 with ShardingTaskExecutorPoolMaxSize&ShardingTaskExecutorPoolMinSize = 20 , I have already done with it .

All the 4 tests are using taskExecutorPoolSize = 1 and ShardingTaskExecutorPoolMaxSize&ShardingTaskExecutorPoolMinSize = 20

  • For the first test, the machine runs with YCSB is 8Core(all the previous pressure test is also 8Core), and we see everythings works well ( threads: 48)
  • I found that the trend graph I provided earlier did not achieve better performance as the number of threads increased, so I upgrate the machine runs with YCSB to 16 Core for the other 3 test(threads : 48 / 64 / 96), and we got benifit from it 

Summary YCSB result with all the tests Using ShardingTaskExecutorPoolMaxSize&ShardingTaskExecutorPoolMinSize = 20 :

  • taskExecutorPoolSize = 4 with YCSB 8Core : Operations - 38324742 , AverageLatency(us) - 734 , 99thPercentileLatency(us) - 1877
  • taskExecutorPoolSize = 1 with YCSB 8Core : Operations - 38200525 , AverageLatency(us) - 736 , 99thPercentileLatency(us) - 1893
  • taskExecutorPoolSize = 1 with YCSB 16Core : Operations - 39668980 , AverageLatency(us) - 709 , 99thPercentileLatency(us) - 1856

As we see , there is no much difference (may be a little benifit from 16Core).

 

I will pay close attention to SERVER-53853 and look forward to the solution of the problem

Comment by Bruce Lucas (Inactive) [ 21/Jan/21 ]

No problem wangxin201492@gmail.com - I think your test with pool sizes gives us the confidence we needed that this ticket is the same issue.

Regarding SERVER-53853, it is currently under active investigation, and we'll update the ticket when we have more information.

Regarding settings in your production environment, we believe that setting taskExecutorPoolSize to a value larger than 1 is not useful starting in 4.2, and you might get better performance by leaving it at 1. Also I wonder if that would make the workaround of setting min and max pool size more feasible in your environment, as you would have one single pool instead of multiple smaller pools that might not individually be optimally sized.

Comment by Xin Wang [ 21/Jan/21 ]

Sorry, I have no environment to test it on 4.4

 

I found that SERVER-53853  is under investigation, any idea to find the root cause of the problem or any further plan about that? 

set ShardingTaskExecutorPoolMaxSize&ShardingTaskExecutorPoolMinSize to  same value works well for pressure test, but it may not be applied in the production environment~ 

hoping for the perfect soluation on 4.2

Comment by Xin Wang [ 20/Jan/21 ]

Thanks for your reply.

That's helpful for this issue. I follow by your suggestion and adjust ShardingTaskExecutorPoolMaxSize and ShardingTaskExecutorPoolMinSize both to  20, the shard cluster runs smoothly!

 

The number of threads for these three tests as seen in the picture are 48 / 64 / 96 , there is no steep drop mentioned above.

 

I'll reply more after try more about ShardingTaskExecutorPoolMaxSize/ShardingTaskExecutorPoolMinSize/taskExecutorPoolSize. And for testing it on 4.4 , there is some difficult for me, I'll try my best to do it.

 

Comment by Bruce Lucas (Inactive) [ 20/Jan/21 ]

This may be related to SERVER-53853 as there are several common elements:

Version is 4.2.10

The test is doing secondary reads:

mongodb.readPreference=secondary

There are slow queries on mongos, but not mongod, indicating a bottleneck in mongos.

A-B and C-D are two periods when queuing is occurring in mongos, indicated by "connections active", meaning that mongos would be seeing slow queries. During those periods, we see likely unncessary connections being created to the secondaries: "totalInUse" is considerably larger than "connections active", suggesting that mongos incorrectly thinks some connections are in use when the can't be.

The perf CPU reported may be associated with establishing connections.

wangxin201492@gmail.com, would you be able to try one or both of the following to test this hypothesis?

  • Set ShardingTaskExecutorPoolMaxSize and ShardingTaskExecutorPoolMinSize both to the same value, perhaps 20 based on what I see in your test, although you may have to experiment with this a bit. We think that doing this mitigates SERVER-53853.
  • Try the test on 4.4, as we don't think it is impacted by SERVER-53853.
Comment by Xin Wang [ 20/Jan/21 ]

the attachment "mongos.diagnostic.data.tar.gz" is one of mongos diagnostic log with newest pressure test between 2020-01-20 18:15(+8) and 2020-01-20 18:25(+8), and some slow log in it (2021-01-20T18:22:07.532+0800)

Generated at Thu Feb 08 05:32:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.