[SERVER-53899] TaskExecutor CPU fills up instantly Created: 20/Jan/21 Updated: 08/Jan/24 Resolved: 22/Jan/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Xin Wang | Assignee: | Bruce Lucas (Inactive) |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | 6 mongos with taskExecutorPoolSize=4 and 8Core 3 shard with PSSSSH
use YCSB to pressure it: ./bin/ycsb run mongodb -P workloads/custom -s -threads 48 -p mongodb.url=xxx
YCSB workload: recordcount=100000 |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
we have a shardCluster with 6 mongos and 3 shard. Each mongos use 8 Core controlled by cgroup, and taskExecutorPoolSize is 4 Each shard has 1 primary , 4 secondaries and 1 hidden.
we use ycsb to pressure test it with 48 thread. Anything goes OK,But there may be steep drop(both CPU&opcounters) appeared in mongos once or twice for a 10min pressure test,
When all things go ok, the CPU like:
When the steep drop happened,someone TaskExecutor CPU fill up like: R 99.9 0.0 8:54.62 TaskExe.rPool-2
pstack result for TaskExecutor with CPU filled up: Thread 85 (Thread 0x7f01c6203700 (LWP 129527)):
perf top result when CPU full happened: about 7.32% for OPENSSL_cleanse.
There are also slow log appeared in mongos, while secondary's log do not contain slow log.The reason is that the connectionPool in TaskExecutor with CPU filled up has many requests to be sent, the connectionPool stats log is : Updating controller for host:port with State: { requests: 19, ready: 0, pending: 2, active: 1, isExpired: false } The request continues to grow, and pending is always 2. I think it just the result of TaskExecutor CPU full
There 2 things worth mentioning
How do I solved this problem?
|
| Comments |
| Comment by Bruce Lucas (Inactive) [ 22/Jan/21 ] |
|
Thank you for the update. I'll close this ticket as a duplicate of |
| Comment by Xin Wang [ 22/Jan/21 ] |
|
I also think this is the same issue with
About test with taskExecutorPoolSize = 1 with ShardingTaskExecutorPoolMaxSize&ShardingTaskExecutorPoolMinSize = 20 , I have already done with it .
All the 4 tests are using taskExecutorPoolSize = 1 and ShardingTaskExecutorPoolMaxSize&ShardingTaskExecutorPoolMinSize = 20
Summary YCSB result with all the tests Using ShardingTaskExecutorPoolMaxSize&ShardingTaskExecutorPoolMinSize = 20 :
As we see , there is no much difference (may be a little benifit from 16Core).
I will pay close attention to |
| Comment by Bruce Lucas (Inactive) [ 21/Jan/21 ] |
|
No problem wangxin201492@gmail.com - I think your test with pool sizes gives us the confidence we needed that this ticket is the same issue. Regarding Regarding settings in your production environment, we believe that setting taskExecutorPoolSize to a value larger than 1 is not useful starting in 4.2, and you might get better performance by leaving it at 1. Also I wonder if that would make the workaround of setting min and max pool size more feasible in your environment, as you would have one single pool instead of multiple smaller pools that might not individually be optimally sized. |
| Comment by Xin Wang [ 21/Jan/21 ] |
|
Sorry, I have no environment to test it on 4.4
I found that set ShardingTaskExecutorPoolMaxSize&ShardingTaskExecutorPoolMinSize to same value works well for pressure test, but it may not be applied in the production environment~ hoping for the perfect soluation on 4.2 |
| Comment by Xin Wang [ 20/Jan/21 ] |
|
Thanks for your reply. That's helpful for this issue. I follow by your suggestion and adjust ShardingTaskExecutorPoolMaxSize and ShardingTaskExecutorPoolMinSize both to 20, the shard cluster runs smoothly! The number of threads for these three tests as seen in the picture are 48 / 64 / 96 , there is no steep drop mentioned above.
I'll reply more after try more about ShardingTaskExecutorPoolMaxSize/ShardingTaskExecutorPoolMinSize/taskExecutorPoolSize. And for testing it on 4.4 , there is some difficult for me, I'll try my best to do it.
|
| Comment by Bruce Lucas (Inactive) [ 20/Jan/21 ] |
|
This may be related to Version is 4.2.10 The test is doing secondary reads:
There are slow queries on mongos, but not mongod, indicating a bottleneck in mongos.
A-B and C-D are two periods when queuing is occurring in mongos, indicated by "connections active", meaning that mongos would be seeing slow queries. During those periods, we see likely unncessary connections being created to the secondaries: "totalInUse" is considerably larger than "connections active", suggesting that mongos incorrectly thinks some connections are in use when the can't be. The perf CPU reported may be associated with establishing connections. wangxin201492@gmail.com, would you be able to try one or both of the following to test this hypothesis?
|
| Comment by Xin Wang [ 20/Jan/21 ] |
|
the attachment "mongos.diagnostic.data.tar.gz" is one of mongos diagnostic log with newest pressure test between 2020-01-20 18:15(+8) and 2020-01-20 18:25(+8), and some slow log in it (2021-01-20T18:22:07.532+0800) |