[SERVER-50904] mongos cost much sy% CPU, all sy% CPU is 100%。All requests from the client avalanche Created: 14/Sep/20  Updated: 03/May/21  Resolved: 03/May/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: y yz Assignee: Edwin Zhou
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2020-09-14-15-01-45-113.png     PNG File image-2020-09-14-15-02-20-989.png     PNG File image-2020-09-14-15-03-04-428.png    
Operating System: ALL
Participants:

 Description   

[root@bjhtxx var]# mongos --version
mongos version v3.6.1
git version: 025d4f4fe61efd1fb6f0005be20cb45a004093d1
OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
allocator: tcmalloc
modules: none
build environment:
distmod: rhel62
distarch: x86_64
target_arch: x86_64

top shows as following:
top - 06:12:58 up 652 days, 16:14, 0 users, load average: 0.60, 0.39, 0.26
Tasks: 714 total, 1 running, 712 sleeping, 0 stopped, 1 zombie
Cpu(s): 2.5%us, 6.1%sy, 0.0%ni, 90.7%id, 0.0%wa, 0.0%hi, 0.6%si, 0.0%st
Mem: 131819804k total, 126707228k used, 5112576k free, 190928k buffers
Swap: 16777724k total, 0k used, 16777724k free, 61473624k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3249 root 20 0 2377m 423m 5744 S 323.2 0.3 270937:13 mongos
5900 root 20 0 2341m 316m 5732 S 62.6 0.2 263301:13 mongos
542 root 20 0 0 0 0 S 3.6 0.0 40:28.79 kswapd1
4 root 20 0 0 0 0 S 2.3 0.0 8869:27 ksoftirqd/0
33 root 20 0 0 0 0 S 2.0 0.0 4411:00 ksoftirqd/7
39893 root 20 0 2850m 34m 8084 S 2.0 0.0 262:43.09 osec_osta_agent
85 root 20 0 0 0 0 S 1.7 0.0 9155:36 ksoftirqd/20
113 root 20 0 0 0 0 S 1.7 0.0 4374:15 ksoftirqd/27
31563 tingfeng 20 0 2379m 43m 4188 S 1.7 0.0 1235:53 node_exporter
9 root 20 0 0 0 0 S 1.3 0.0 7210:33 ksoftirqd/1
13 root 20 0 0 0 0 S 1.3 0.0 6376:45 ksoftirqd/2
25 root 20 0 0 0 0 S 1.3 0.0 5072:24 ksoftirqd/5
89 root 20 0 0 0 0 S 1.3 0.0 7302:00 ksoftirqd/21
105 root 20 0 0 0 0 S 1.3 0.0 5126:01 ksoftirqd/25

top - 06:13:01 up 652 days, 16:14, 0 users, load average: 0.60, 0.39, 0.26
Tasks: 714 total, 1 running, 712 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.7%us, 74.4%sy, 0.0%ni, 24.6%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 131819804k total, 126711196k used, 5108608k free, 190928k buffers
Swap: 16777724k total, 0k used, 16777724k free, 61479556k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3249 root 20 0 2377m 432m 5744 S 3323.0 0.3 270938:54 mongos
5900 root 20 0 2341m 318m 5732 S 777.9 0.2 263301:37 mongos
3002 mongodb 20 0 1568m 386m 5336 S 3.6 0.3 43861:47 mongos
39893 root 20 0 2850m 35m 8084 S 1.7 0.0 262:43.14 osec_osta_agent
31866 mongodb 20 0 62.1g 57g 8232 S 1.0 45.4 74503:48 mongod
24663 root 20 0 15544 1744 920 R 0.7 0.0 346:50.50 top
4 root 20 0 0 0 0 S 0.3 0.0 8869:27 ksoftirqd/0
9 root 20 0 0 0 0 S 0.3 0.0 7210:33 ksoftirqd/1
113 root 20 0 0 0 0 S 0.3 0.0 4374:15 ksoftirqd/27
117 root 20 0 0 0 0 S 0.3 0.0 3765:47 ksoftirqd/28
32913 root 20 0 125m 23m 6720 S 0.3 0.0 678:49.79 agent
1 root 20 0 19356 1336 1012 S 0.0 0.0 1:01.95 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 4:16.03 migration/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/0
6 root RT 0 0 0 0 S 0.0 0.0 1:01.38 watchdog/0

top - 06:13:04 up 652 days, 16:14, 0 users, load average: 6.48, 1.61, 0.65
Tasks: 714 total, 1 running, 712 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.8%us, 95.6%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 0.5%si, 0.0%st
Mem: 131819804k total, 126686300k used, 5133504k free, 190936k buffers
Swap: 16777724k total, 0k used, 16777724k free, 61481400k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3249 root 20 0 2377m 453m 5744 S 2719.9 0.4 270940:16 mongos
5900 root 20 0 2341m 318m 5732 S 1024.2 0.2 263302:08 mongos
82 root RT 0 0 0 0 S 2.6 0.0 0:56.15 watchdog/19
3002 mongodb 20 0 1568m 386m 5336 S 2.6 0.3 43861:47 mongos
31866 mongodb 20 0 62.1g 57g 8232 S 2.0 45.4 74503:48 mongod
39893 root 20 0 2850m 34m 8084 S 1.7 0.0 262:43.19 osec_osta_agent
31563 tingfeng 20 0 2379m 44m 4188 S 1.0 0.0 1235:53 node_exporter
45 root 20 0 0 0 0 S 0.7 0.0 208:16.39 ksoftirqd/10
49 root 20 0 0 0 0 S 0.7 0.0 144:14.48 ksoftirqd/11
77 root 20 0 0 0 0 S 0.7 0.0 5:53.06 ksoftirqd/18
153 root 20 0 0 0 0 S 0.7 0.0 8:00.48 ksoftirqd/37
23455 root 20 0 0 0 0 S 0.7 0.0 25:14.61 flush-8:0
24663 root 20 0 15544 1744 920 R 0.7 0.0 346:50.52 top
32913 root 20 0 125m 23m 6720 S 0.7 0.0 678:49.81 agent
4 root 20 0 0 0 0 S 0.3 0.0 8869:27 ksoftirqd/0
9 root 20 0 0 0 0 S 0.3 0.0 7210:33 ksoftirqd/1
13 root 20 0 0 0 0 S 0.3 0.0 6376:45 ksoftirqd/2
17 root 20 0 0 0 0 S 0.3 0.0 5378:13 ksoftirqd/3
21 root 20 0 0 0 0 S 0.3 0.0 4083:40 ksoftirqd/4
25 root 20 0 0 0 0 S 0.3 0.0 5072:24 ksoftirqd/5

top - 06:13:07 up 652 days, 16:14, 0 users, load average: 6.48, 1.61, 0.65
Tasks: 714 total, 1 running, 712 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.6%us, 98.6%sy, 0.0%ni, 0.5%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 131819804k total, 126654344k used, 5165460k free, 190968k buffers
Swap: 16777724k total, 0k used, 16777724k free, 61483900k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3249 root 20 0 2377m 453m 5744 S 2821.5 0.4 270941:42 mongos
5900 root 20 0 2341m 318m 5732 S 963.5 0.2 263302:37 mongos
39893 root 20 0 2850m 34m 8084 S 1.3 0.0 262:43.23 osec_osta_agent
3002 mongodb 20 0 1568m 386m 5336 S 0.7 0.3 43861:47 mongos
24663 root 20 0 15544 1744 920 R 0.7 0.0 346:50.54 top
31866 mongodb 20 0 62.1g 57g 8232 S 0.7 45.4 74503:48 mongod
32913 root 20 0 125m 23m 6720 S 0.7 0.0 678:49.83 agent
35094 tingfeng 20 0 4452m 37m 3148 S 0.7 0.0 267:12.23 tingfeng-cli
4 root 20 0 0 0 0 S 0.3 0.0 8869:27 ksoftirqd/0
9 root 20 0 0 0 0 S 0.3 0.0 7210:33 ksoftirqd/1
65 root 20 0 0 0 0 S 0.3 0.0 8:52.01 ksoftirqd/15
69 root 20 0 0 0 0 S 0.3 0.0 6:57.76 ksoftirqd/16
73 root 20 0 0 0 0 S 0.3 0.0 6:10.23 ksoftirqd/17
121 root 20 0 0 0 0 S 0.3 0.0 3997:21 ksoftirqd/29
133 root 20 0 0 0 0 S 0.3 0.0 83:11.86 ksoftirqd/32
137 root 20 0 0 0 0 S 0.3 0.0 38:57.37 ksoftirqd/33
161 root 20 0 0 0 0 S 0.3 0.0 6:57.32 ksoftirqd/39
3924 root 20 0 18520 916 464 S 0.3 0.0 176:12.67 irqbalance
1 root 20 0 19356 1336 1012 S 0.0 0.0 1:01.95 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 4:16.03 migration/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/0
6 root RT 0 0 0 0 S 0.0 0.0 1:01.38 watchdog/0
7 root RT 0 0 0 0 S 0.0 0.0 3:08.17 migration/1

top - 06:13:10 up 652 days, 16:14, 0 users, load average: 10.36, 2.50, 0.94
Tasks: 714 total, 1 running, 712 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.7%us, 98.2%sy, 0.0%ni, 0.8%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 131819804k total, 126639648k used, 5180156k free, 190976k buffers
Swap: 16777724k total, 0k used, 16777724k free, 61486464k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3249 root 20 0 2377m 454m 5744 S 3968.9 0.4 270943:42 mongos
5900 root 20 0 2341m 318m 5732 S 1654.0 0.2 263303:27 mongos
31866 mongodb 20 0 62.1g 57g 8232 S 1.6 45.4 74503:48 mongod
32913 root 20 0 125m 23m 6720 S 1.6 0.0 678:49.88 agent
39893 root 20 0 2850m 34m 8084 S 1.6 0.0 262:43.28 osec_osta_agent
3002 mongodb 20 0 1568m 386m 5336 S 1.0 0.3 43861:47 mongos
24663 root 20 0 15544 1744 920 R 1.0 0.0 346:50.57 top
31563 tingfeng 20 0 2379m 43m 4188 S 1.0 0.0 1235:53 node_exporter
13 root 20 0 0 0 0 S 0.3 0.0 6376:45 ksoftirqd/2
21 root 20 0 0 0 0 S 0.3 0.0 4083:40 ksoftirqd/4
29 root 20 0 0 0 0 S 0.3 0.0 2711:07 ksoftirqd/6
37 root 20 0 0 0 0 S 0.3 0.0 3768:46 ksoftirqd/8
57 root 20 0 0 0 0 S 0.3 0.0 32:41.22 ksoftirqd/13
89 root 20 0 0 0 0 S 0.3 0.0 7302:00 ksoftirqd/21
93 root 20 0 0 0 0 S 0.3 0.0 6423:23 ksoftirqd/22
105 root 20 0 0 0 0 S 0.3 0.0 5126:01 ksoftirqd/25
117 root 20 0 0 0 0 S 0.3 0.0 3765:47 ksoftirqd/28
125 root 20 0 0 0 0 S 0.3 0.0 205:44.81 ksoftirqd/30
10831 root 20 0 4054m 325m 2568 S 0.3 0.3 269:43.52 java
1 root 20 0 19356 1336 1012 S 0.0 0.0 1:01.95 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 4:16.03 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 8869:27 ksoftirqd/0

top - 06:13:13 up 652 days, 16:14, 0 users, load average: 14.42, 3.47, 1.27
Tasks: 714 total, 1 running, 712 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.6%us, 98.3%sy, 0.0%ni, 0.7%id, 0.0%wa, 0.0%hi, 0.4%si, 0.0%st
Mem: 131819804k total, 126635428k used, 5184376k free, 190976k buffers
Swap: 16777724k total, 0k used, 16777724k free, 61489364k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3249 root 20 0 2377m 453m 5744 S 5460.0 0.4 270946:27 mongos
5900 root 20 0 2341m 318m 5732 S 2108.6 0.2 263304:31 mongos
31866 mongodb 20 0 62.1g 57g 8236 S 8.9 45.4 74503:49 mongod
3002 mongodb 20 0 1568m 386m 5336 S 5.0 0.3 43861:47 mongos
39893 root 20 0 2850m 34m 8084 S 2.0 0.0 262:43.34 osec_osta_agent
24663 root 20 0 15544 1744 920 R 0.7 0.0 346:50.59 top
32913 root 20 0 125m 23m 6720 S 0.7 0.0 678:49.90 agent
17 root 20 0 0 0 0 S 0.3 0.0 5378:13 ksoftirqd/3
29 root 20 0 0 0 0 S 0.3 0.0 2711:07 ksoftirqd/6
41 root 20 0 0 0 0 S 0.3 0.0 4026:59 ksoftirqd/9
49 root 20 0 0 0 0 S 0.3 0.0 144:14.49 ksoftirqd/11
53 root 20 0 0 0 0 S 0.3 0.0 70:02.14 ksoftirqd/12
61 root 20 0 0 0 0 S 0.3 0.0 15:21.65 ksoftirqd/14
77 root 20 0 0 0 0 S 0.3 0.0 5:53.07 ksoftirqd/18
81 root 20 0 0 0 0 S 0.3 0.0 5:40.61 ksoftirqd/19
97 root 20 0 0 0 0 S 0.3 0.0 5416:48 ksoftirqd/23
109 root 20 0 0 0 0 S 0.3 0.0 2723:55 ksoftirqd/26
149 root 20 0 0 0 0 S 0.3 0.0 8:44.48 ksoftirqd/36
157 root 20 0 0 0 0 S 0.3 0.0 7:17.41 ksoftirqd/38

top - 06:13:19 up 652 days, 16:14, 0 users, load average: 18.55, 4.51, 1.61
Tasks: 714 total, 1 running, 712 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.7%us, 98.6%sy, 0.0%ni, 0.3%id, 0.0%wa, 0.0%hi, 0.4%si, 0.0%st
Mem: 131819804k total, 126641620k used, 5178184k free, 190984k buffers
Swap: 16777724k total, 0k used, 16777724k free, 61493904k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3249 root 20 0 2377m 453m 5744 S 8128.1 0.4 270950:34 mongos
5900 root 20 0 2341m 318m 5732 S 3047.9 0.2 263306:03 mongos
53 root 20 0 0 0 0 S 2.0 0.0 70:02.20 ksoftirqd/12
39893 root 20 0 2850m 34m 8084 S 1.6 0.0 262:43.44 osec_osta_agent
31563 tingfeng 20 0 2379m 43m 4188 S 1.0 0.0 1235:53 node_exporter
31866 mongodb 20 0 62.1g 57g 8236 S 1.0 45.4 74503:49 mongod
3002 mongodb 20 0 1568m 386m 5336 S 0.7 0.3 43861:47 mongos
24663 root 20 0 15544 1744 920 R 0.7 0.0 346:50.62 top
29241 tingfeng 20 0 4764m 44m 3460 S 0.7 0.0 293:15.35 ntp_exporter
9 root 20 0 0 0 0 S 0.3 0.0 7210:33 ksoftirqd/1
41 root 20 0 0 0 0 S 0.3 0.0 4026:59 ksoftirqd/9
45 root 20 0 0 0 0 S 0.3 0.0 208:16.40 ksoftirqd/10
65 root 20 0 0 0 0 S 0.3 0.0 8:52.02 ksoftirqd/15
69 root 20 0 0 0 0 S 0.3 0.0 6:57.77 ksoftirqd/16
97 root 20 0 0 0 0 S 0.3 0.0 5416:48 ksoftirqd/23
137 root 20 0 0 0 0 S 0.3 0.0 38:57.38 ksoftirqd/33

disk IO is very low, memory is enough

load and cpu monitor is :

net as following:

disk io as following:



 Comments   
Comment by Edwin Zhou [ 03/May/21 ]

Hi 1147952115@qq.com

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information that we've requested above and we will reopen the ticket.

Best,
Edwin

Comment by Edwin Zhou [ 21/Apr/21 ]

Hi 1147952115@qq.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please attempt the procedure that Eric suggested, and archive (tar or zip) the following:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)
    You may attach these to the same secure upload portal.

Thanks,
Edwin

Comment by Eric Sedor [ 29/Mar/21 ]

Hello again 1147952115@qq.com. I am very sorry to have lost track of this ticket.

We've had a chance to examine the diagnostic data you provided. We don't see evidence of a bug yet, as the CPU system time appears correlated with memory management activity by the operating system.

That said, we would be interested if this issue reproduced for you on MongoDB 4.0. Are you able to do this? If so, then for each node in an impacted replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)
Comment by y yz [ 25/Dec/20 ]

@Eric Sedor

I have send the file to you。because the mongos diagnostic have deleted,the sharding cluster always have the problem,so we Down the cluster。

but another repl cluster, Occasionally there are problems with high sys loads, I send one node mongod diagnostic to you, the Jitter time point about "2020/12/22-12:04" and "2020/12/22-12:24"。

Looking forward to you reply

Comment by Eric Sedor [ 15/Dec/20 ]

1147952115@qq.com you can upload a zipped or tarred diagnostic.data direcories to this ticket or to this secure upload portal. Please just be sure to include the diagnosic.data directory from the mongos experiencing problems as well as for each node in one of the shards of the sharded cluster.

Comment by y yz [ 15/Dec/20 ]

@Eric Sedor

v3.6.x all have the same problem。

Comment by y yz [ 15/Dec/20 ]

@Eric Sedor , can you tell me that how to push the $dbpath/diagnostic.data? and where to push?

Problems occur on a daily basis and are very serious。

Looking forward to reply

Comment by y yz [ 15/Dec/20 ]

can you tell me how to push the $dbpath/diagnostic.data?

Comment by Eric Sedor [ 16/Sep/20 ]

Hello,

For both the mongos and a representative shard, could you please archive (tar or zip) the $dbpath/diagnostic.data directories (the contents are described here) and attach them to this ticket?

Additionally, can you try again to attach corresponding log files? The previous link does not seem to have worked.

Sincerely,
Eric

Comment by y yz [ 14/Sep/20 ]

the storage node, mongod is ok, no slow log.

there have too much connections,connect and close frequncy:
[root@xx yyz]# grep "2020-09-11T06:13:0" mongodb_911061.log | grep "received client metadata from" |wc -l
8977
[root@xx yyz]# grep "2020-09-11T06:13:0" mongodb_911061.log | grep "end connection" |wc -l
13354
[root@xx y yz]#

mongos log as following:
[^mongodb_911061.log]

Comment by y yz [ 14/Sep/20 ]

@kaloian.manassiev
can you help me? thanks

Generated at Thu Feb 08 05:23:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.