|
Thanks 601290552@qq.com, as I no longer see queued operations or active read/write operations being stuck at 128, I will go ahead and close this ticket. If you need further assistance with tuning, I encourage you to ask our community by posting on the MongoDB Community Forums or on Stack Overflow with the mongodb tag.
|
|
i upload files to https://10gen-httpsupload.s3.amazonaws.com/upload_forms/d9451bbe-fdc6-46dd-8f9b-097a8024f9e6.html
|
|
Hi 601290552@qq.com, yes I have reviewed the uploaded data and it was very helpful.
Given the pattern and shape of the operations over time, could you please clarify if this workload is being generated by some load generator / load scripts or is it a real production workload?
Now, with profiling disabled, we have a better visibility into the next bottleneck - WiredTiger eviction server.
To be more specific, eviction server struggles in finding pages to evict because the current workload simply overwhelms eviction (or in other words, the cache is being held in a state that is not suitable for the running workload and for eviction server to find pages to evict). I believe that the issue is with how the current write operations are being executed, because there are so many of them it is like a death by a thousand paper cuts. These many short write operations are trashing the WiredTiger cache so fast (all cache is being currently trashed within 10 minutes) that eventually, we start to having trouble identifying what data needs to stay in the cache and what data needs to be evicted (determining how many of the pages we keep in the queue and discard the rest).
One thing that you might consider implementing immediately to improve performance is using [bulk() write operations\https://docs.mongodb.com/manual/reference/method/Bulk/index.html] for which the WiredTiger storage engine is optimized. Once this is implemented, we can review the fresh data to determine the impact.
Thanks,
Dima
|
|
Dmitry Agranat;
is it has result for it?
|
|
hi:
i deal with the following step:
- Enable readConcern:majority
- Leave profiling off
- Let it run for a day
now i upload srvdb1ytodiagnostic.tar.gz to https://10gen-httpsupload.s3.amazonaws.com/upload_forms/d9451bbe-fdc6-46dd-8f9b-097a8024f9e6.html
|
|
Dmitry Agranat;
it is ok.i will deal with it.
|
|
601290552@qq.com, let's take a step back. As I've mentioned earlier, you do not need to disable readConcern:majority for PSS.
Next steps:
- Enable readConcern:majority
- Leave profiling off
- Let it run for a day
After a day, upload the fresh data from srvdb1.yto.cloud for us to review (no need to upload data from the rest of the servers).
|
|
now.i disable profiling and disable readConcern:majority for pss.
but when collect size more than cachesize,mongo performance is slow.
13027 12969 12972 *0 805 1045|0 20.1% 74.6% 0 101G 80.6G 0|55 18|128 27.1m 54.1m 330 myReplicaSet10003 PRI Apr 22 16:51:23.174
|
7222 7246 7310 *0 687 888|0 20.1% 74.7% 0 101G 80.6G 0|168 6|128 15.5m 30.4m 330 myReplicaSet10003 PRI Apr 22 16:51:24.175
|
6158 6146 6150 *0 615 751|0 20.1% 74.7% 0 101G 80.6G 0|93 9|128 13.1m 25.7m 330 myReplicaSet10003 PRI Apr 22 16:51:25.174
|
16235 16235 16246 *0 607 607|0 20.1% 74.6% 0 101G 80.6G 8|95 4|128 33.2m 67.3m 330 myReplicaSet10003 PRI Apr 22 16:51:26.175
|
7237 7259 7235 *0 658 834|0 20.1% 74.7% 0 101G 80.6G 0|153 22|128 15.4m 30.3m 330 myReplicaSet10003 PRI Apr 22 16:51:27.173
|
6952 6922 6948 *0 669 871|0 20.1% 74.7% 0 101G 80.6G 0|169 6|128 14.8m 29.3m 330 myReplicaSet10003 PRI Apr 22 16:51:28.175
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
11225 11189 11087 *0 924 1110|0 20.1% 74.7% 0 101G 80.6G 0|17 9|17 23.5m 46.4m 330 myReplicaSet10003 PRI Apr 22 16:51:29.175
|
6028 6095 6174 *0 587 732|0 20.1% 74.7% 0 101G 80.6G 0|126 3|128 13.1m 26.1m 330 myReplicaSet10003 PRI Apr 22 16:51:30.174
|
6757 6741 6767 *0 686 877|0 20.1% 74.7% 0 101G 80.6G 0|163 6|125 14.4m 28.2m 330 myReplicaSet10003 PRI Apr 22 16:51:31.174
|
10319 10313 10271 *0 825 1011|0 20.1% 74.7% 0 101G 80.6G 0|87 5|128 21.7m 43.2m 330 myReplicaSet10003 PRI Apr 22 16:51:32.175
|
10705 10747 10788 *0 826 1057|0 20.1% 74.7% 0 101G 80.6G 0|172 1|128 22.7m 45.1m 330 myReplicaSet10003 PRI Apr 22 16:51:33.173
|
5583 5599 5481 *0 580 735|0 20.1% 74.8% 0 101G 80.6G 1|172 1|128 11.9m 23.3m 330 myReplicaSet10003 PRI Apr 22 16:51:34.174
|
5939 5901 6032 *0 731 939|0 20.1% 74.8% 0 101G 80.6G 0|172 1|128 13.0m 25.4m 330 myReplicaSet10003 PRI Apr 22 16:51:35.173
|
12048 12033 12020 *0 827 939|0 20.1% 74.8% 0 101G 80.6G 0|170 3|128 25.1m 50.0m 330 myReplicaSet10003 PRI Apr 22 16:51:36.177
|
4139 4144 4142 *0 417 508|0 20.1% 74.8% 0 101G 80.7G 0|172 3|128 8.84m 17.5m 330 myReplicaSet10003 PRI Apr 22 16:51:37.173
|
23168 23139 23059 *0 776 856|0 20.1% 74.7% 0 101G 80.7G 0|0 3|2 47.2m 95.5m 330 myReplicaSet10003 PRI Apr 22 16:51:38.175
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
24463 24478 24469 *0 866 990|0 20.1% 74.5% 0 101G 80.7G 0|10 2|128 50.2m 101m 330 myReplicaSet10003 PRI Apr 22 16:51:39.174
|
25418 25464 25514 *0 832 955|0 20.1% 74.4% 0 101G 80.7G 0|76 8|128 52.1m 105m 330 myReplicaSet10003 PRI Apr 22 16:51:40.179
|
25251 25156 25174 *0 769 858|0 20.1% 74.3% 0 101G 80.7G 0|103 9|128 51.5m 104m 330 myReplicaSet10003 PRI Apr 22 16:51:41.177
|
21833 21878 21773 *0 660 832|0 20.1% 74.2% 0 101G 80.7G 0|0 6|111 44.6m 90.4m 330 myReplicaSet10003 PRI Apr 22 16:51:42.178
|
22951 23009 23112 *0 761 812|0 20.1% 74.1% 0 101G 80.7G 3|141 2|128 47.1m 95.0m 330 myReplicaSet10003 PRI Apr 22 16:51:43.179
|
24354 24242 24248 *0 782 865|0 20.1% 74.0% 0 101G 80.7G 0|46 6|125 49.7m 100m 330 myReplicaSet10003 PRI Apr 22 16:51:44.178
|
24045 24092 24029 *0 811 989|0 20.1% 73.9% 0 101G 80.7G 0|5 3|4 49.2m 99.7m 330 myReplicaSet10003 PRI Apr 22 16:51:45.175
|
23638 23691 23754 *0 789 946|0 20.1% 73.8% 0 101G 80.7G 0|112 11|127 48.6m 97.4m 330 myReplicaSet10003 PRI Apr 22 16:51:46.178
|
25115 25078 24981 *0 815 881|0 20.1% 73.7% 0 101G 80.7G 0|0 7|51 51.2m 104m 330 myReplicaSet10003 PRI Apr 22 16:51:47.174
|
26252 26254 26275 *0 828 906|0 20.1% 73.6% 0 101G 80.7G 0|0 14|90 53.7m 108m 330 myReplicaSet10003 PRI Apr 22 16:51:48.178
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
25256 25184 25229 *0 830 891|0 20.1% 73.5% 0 101G 80.7G 0|0 5|8 51.6m 104m 330 myReplicaSet10003 PRI Apr 22 16:51:49.174
|
24670 24747 24712 *0 859 1044|0 20.1% 73.4% 0 101G 80.7G 3|111 2|127 50.7m 102m 330 myReplicaSet10003 PRI Apr 22 16:51:50.176
|
21882 21854 21766 *0 762 853|0 20.1% 73.4% 0 101G 80.7G 0|2 7|1 44.7m 90.4m 330 myReplicaSet10003 PRI Apr 22 16:51:51.173
|
20324 20338 20328 *0 693 857|0 20.1% 73.3% 0 101G 80.7G 0|57 6|10 41.7m 83.6m 330 myReplicaSet10003 PRI Apr 22 16:51:52.175
|
20351 20419 20523 *0 697 835|0 20.1% 73.2% 0 101G 80.7G 0|114 5|128 41.9m 84.3m 330 myReplicaSet10003 PRI Apr 22 16:51:53.173
|
14293 14246 14168 *0 477 574|0 20.1% 73.2% 0 101G 80.7G 0|22 19|128 29.1m 58.2m 330 myReplicaSet10003 PRI Apr 22 16:51:54.175
|
18930 18904 19000 *0 506 546|0 20.1% 73.1% 0 101G 80.7G 0|88 1|128 38.6m 78.9m 330 myReplicaSet10003 PRI Apr 22 16:51:55.175
|
20093 20202 20237 *0 698 797|0 20.2% 73.1% 0 101G 80.7G 0|133 4|128 41.3m 83.2m 330 myReplicaSet10003 PRI Apr 22 16:51:56.175
|
19442 19380 19381 *0 604 712|0 20.2% 73.0% 0 101G 80.7G 0|148 5|128 39.7m 79.8m 330 myReplicaSet10003 PRI Apr 22 16:51:57.174
|
19581 19642 19480 *0 658 787|0 20.2% 72.9% 0 101G 80.7G 0|108 36|128 40.0m 81.3m 330 myReplicaSet10003 PRI Apr 22 16:51:58.176
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
19595 19500 19506 *0 687 810|0 20.2% 72.9% 0 101G 80.7G 23|0 5|128 40.0m 80.8m 330 myReplicaSet10003 PRI Apr 22 16:51:59.178
|
19993 20051 20235 *0 745 876|0 20.2% 72.8% 0 101G 80.7G 0|154 6|128 41.3m 82.9m 330 myReplicaSet10003 PRI Apr 22 16:52:00.175
|
19920 19886 19841 *0 742 891|0 20.2% 72.8% 0 101G 80.7G 0|138 3|128 40.8m 82.6m 330 myReplicaSet10003 PRI Apr 22 16:52:01.174
|
19765 19791 19827 *0 755 908|0 20.2% 72.7% 0 101G 80.7G 0|121 15|128 40.6m 81.0m 330 myReplicaSet10003 PRI Apr 22 16:52:02.175
|
19753 19731 19666 *0 685 803|0 20.2% 72.7% 0 101G 80.7G 7|63 9|128 40.4m 82.1m 331 myReplicaSet10003 PRI Apr 22 16:52:03.175
|
18967 18983 18996 *0 775 918|0 20.1% 72.6% 0 101G 80.7G 5|98 5|128 39.1m 78.7m 331 myReplicaSet10003 PRI Apr 22 16:52:04.177
|
18620 18567 18601 *0 685 874|0 20.1% 72.6% 0 101G 80.7G 1|166 2|128 38.2m 77.0m 331 myReplicaSet10003 PRI Apr 22 16:52:05.174
|
17357 17397 17283 *0 691 812|0 20.2% 72.5% 0 101G 80.7G 0|142 33|128 35.6m 71.8m 331 myReplicaSet10003 PRI Apr 22 16:52:06.175
|
3953 3957 3944 *0 150 175|0 20.1% 72.5% 0 101G 80.7G 4|0 4|30 8.03m 16.6m 331 myReplicaSet10003 PRI Apr 22 16:52:07.174
|
18969 18956 19018 *0 736 874|0 20.2% 72.5% 0 101G 80.7G 3|50 3|121 39.0m 77.7m 331 myReplicaSet10003 PRI Apr 22 16:52:08.175
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
18841 18845 18802 *0 805 935|0 20.1% 72.4% 0 101G 80.7G 0|11 68|128 38.8m 78.4m 331 myReplicaSet10003 PRI Apr 22 16:52:09.173
|
|
|
hello:
Was profiling disabled?
yes,i disable it.
|
|
Hi 601290552@qq.com,
Did you have a chance to review my last comment? Was profiling disabled? Just posting the last part for your convenience:
Please disable profiling, let the workload run for a day or two and upload the fresh data from srvdb1.yto.cloud for us to review (no need to upload data from the rest of the servers). This might not solve all the observed issues but hopefully will remove one of the main bottlenecks.
Thanks,
Dima
|
|
hello?
why change psa to pss enable readConcern:majority,so that queue so long.
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
*0 *0 *0 *0 0 12|0 22.9% 63.8% 0 107G 90.3G 0|439 42|128 2.35k 826k 634 myReplicaSet10003 PRI Apr 21 18:57:04.627
|
*0 *0 *0 *0 0 12|0 22.9% 63.7% 0 107G 90.3G 0|439 42|128 2.34k 825k 634 myReplicaSet10003 PRI Apr 21 18:57:05.628
|
*0 *0 *0 *0 0 10|0 22.9% 63.7% 0 107G 90.3G 0|439 42|128 2.12k 825k 634 myReplicaSet10003 PRI Apr 21 18:57:06.627
|
*0 *0 *0 *0 0 10|0 22.9% 63.7% 0 107G 90.3G 0|439 42|128 2.27k 824k 634 myReplicaSet10003 PRI Apr 21 18:57:07.627
|
*0 *0 *0 *0 0 10|0 22.9% 63.7% 0 107G 90.3G 0|439 42|128 2.27k 824k 634 myReplicaSet10003 PRI Apr 21 18:57:08.628
|
*0 *0 *0 *0 0 17|0 22.9% 63.7% 0 107G 90.3G 0|439 43|128 3.36k 830k 635 myReplicaSet10003 PRI Apr 21 18:57:09.627
|
*0 *0 *0 *0 0 10|0 22.9% 63.7% 0 107G 90.3G 0|439 43|128 2.27k 825k 632 myReplicaSet10003 PRI Apr 21 18:57:10.627
|
*0 1 *0 *0 0 20|0 22.9% 63.7% 0 107G 90.3G 0|439 44|128 4.31k 867k 634 myReplicaSet10003 PRI Apr 21 18:57:11.627
|
*0 *0 *0 *0 0 12|0 22.9% 63.7% 0 107G 90.3G 0|439 44|128 2.60k 828k 635 myReplicaSet10003 PRI Apr 21 18:57:12.627
|
*0 *0 *0 *0 0 16|0 22.9% 63.7% 0 107G 90.3G 0|439 44|128 3.39k 866k 635 myReplicaSet10003 PRI Apr 21 18:57:13.627
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
*0 *0 *0 *0 0 11|0 22.9% 63.7% 0 107G 90.3G 0|439 44|128 2.35k 827k 635 myReplicaSet10003 PRI Apr 21 18:57:14.627
|
*0 *0 *0 *0 0 12|0 22.9% 63.7% 0 107G 90.3G 0|439 44|128 2.27k 828k 635 myReplicaSet10003 PRI Apr 21 18:57:15.627
|
*0 *0 *0 *0 0 9|0 22.9% 63.7% 0 107G 90.3G 0|439 44|128 2.11k 825k 635 myReplicaSet10003 PRI Apr 21 18:57:16.628
|
*0 *0 *0 *0 0 11|0 22.9% 63.7% 0 107G 90.3G 0|439 44|128 2.27k 827k 635 myReplicaSet10003 PRI Apr 21 18:57:17.627
|
*0 *0 *0 *0 0 10|0 22.9% 63.7% 0 107G 90.3G 0|439 44|128 2.27k 826k 635 myReplicaSet10003 PRI Apr 21 18:57:18.627
|
*0 *0 *0 *0 0 12|0 22.9% 63.7% 0 107G 90.3G 0|439 44|128 2.43k 828k 635 myReplicaSet10003 PRI Apr 21 18:57:19.626
|
*0 *0 *0 *0 0 10|0 22.8% 63.7% 0 107G 90.3G 0|439 44|128 2.27k 826k 635 myReplicaSet10003 PRI Apr 21 18:57:20.627
|
*0 *0 *0 *0 0 11|0 22.8% 63.7% 0 107G 90.3G 0|439 44|128 2.19k 827k 635 myReplicaSet10003 PRI Apr 21 18:57:21.627
|
*0 *0 *0 *0 0 11|0 22.8% 63.7% 0 107G 90.3G 0|439 44|128 2.26k 827k 635 myReplicaSet10003 PRI Apr 21 18:57:22.628
|
*0 *0 *0 *0 0 16|0 22.8% 63.7% 0 107G 90.3G 0|439 44|128 3.39k 866k 635 myReplicaSet10003 PRI Apr 21 18:57:23.627
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
*0 *0 *0 *0 0 12|0 22.8% 63.7% 0 107G 90.3G 0|439 44|128 2.40k 828k 635 myReplicaSet10003 PRI Apr 21 18:57:24.627
|
*0 *0 *0 *0 0 13|0 22.8% 63.7% 0 107G 90.3G 0|439 44|128 2.32k 828k 635 myReplicaSet10003 PRI Apr 21 18:57:25.627
|
*0 *0 *0 *0 0 9|0 22.8% 63.7% 0 107G 90.3G 0|439 44|128 2.11k 825k 635 myReplicaSet10003 PRI Apr 21 18:57:26.628
|
*0 *0 *0 *0 0 11|0 22.8% 63.7% 0 107G 90.3G 0|439 44|128 2.27k 827k 635 myReplicaSet10003 PRI Apr 21 18:57:27.627
|
*0 *0 *0 *0 0 11|0 22.8% 63.7% 0 107G 90.3G 0|439 44|128 2.27k 827k 635 myReplicaSet10003 PRI Apr 21 18:57:28.627
|
*0 *0 *0 *0 0 12|0 22.8% 63.7% 0 107G 90.3G 0|439 44|128 2.43k 828k 635 myReplicaSet10003 PRI Apr 21 18:57:29.626
|
*0 *0 *0 *0 0 10|0 22.8% 63.6% 0 107G 90.3G 0|439 44|128 2.27k 826k 635 myReplicaSet10003 PRI Apr 21 18:57:30.628
|
*0 *0 *0 *0 0 11|0 22.8% 63.6% 0 107G 90.3G 0|439 44|128 2.19k 827k 635 myReplicaSet10003 PRI Apr 21 18:57:31.628
|
*0 *0 *0 *0 0 15|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 2.96k 831k 635 myReplicaSet10003 PRI Apr 21 18:57:32.627
|
*0 *0 *0 *0 0 26|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 5.77k 917k 636 myReplicaSet10003 PRI Apr 21 18:57:33.627
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
*0 *0 *0 *0 0 12|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 2.35k 829k 636 myReplicaSet10003 PRI Apr 21 18:57:34.626
|
*0 *0 *0 *0 0 11|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 2.26k 827k 636 myReplicaSet10003 PRI Apr 21 18:57:35.627
|
*0 *0 *0 *0 0 10|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 2.11k 827k 636 myReplicaSet10003 PRI Apr 21 18:57:36.627
|
*0 *0 *0 *0 0 10|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 2.27k 827k 636 myReplicaSet10003 PRI Apr 21 18:57:37.628
|
*0 *0 *0 *0 0 11|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 2.27k 828k 636 myReplicaSet10003 PRI Apr 21 18:57:38.627
|
*0 *0 *0 *0 0 12|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 2.48k 828k 636 myReplicaSet10003 PRI Apr 21 18:57:39.627
|
*0 *0 *0 *0 0 10|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 2.27k 827k 636 myReplicaSet10003 PRI Apr 21 18:57:40.627
|
*0 *0 *0 *0 0 11|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 2.24k 828k 636 myReplicaSet10003 PRI Apr 21 18:57:41.628
|
*0 *0 *0 *0 0 12|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 2.27k 829k 636 myReplicaSet10003 PRI Apr 21 18:57:42.627
|
*0 *0 *0 *0 0 15|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 3.38k 866k 636 myReplicaSet10003 PRI Apr 21 18:57:43.628
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
*0 *0 *0 *0 0 12|0 22.8% 63.6% 0 107G 90.3G 0|439 45|128 2.35k 829k 636 myReplicaSet10003 PRI Apr 21 18:57:44.626
|
*0 *0 *0 *0 0 11|0 22.7% 63.6% 0 107G 90.3G 0|439 45|128 2.26k 827k 636 myReplicaSet10003 PRI Apr 21 18:57:45.628
|
*0 *0 *0 *0 0 9|0 22.7% 63.6% 0 107G 90.3G 0|439 45|128 2.11k 827k 636 myReplicaSet10003 PRI Apr 21 18:57:46.628
|
*0 *0 *0 *0 0 10|0 22.7% 63.6% 1 107G 90.3G 0|439 45|128 2.27k 827k 636 myReplicaSet10003 PRI Apr 21 18:57:47.628
|
*0 19 5 *0 0 24|0 22.7% 63.5% 0 107G 90.3G 0|471 6|128 36.9k 1.01m 632 myReplicaSet10003 PRI Apr 21 18:57:48.628
|
*0 *0 *0 *0 0 13|0 22.7% 63.5% 0 107G 90.3G 0|471 7|128 2.65k 889k 631 myReplicaSet10003 PRI Apr 21 18:57:49.627
|
*0 *0 *0 *0 0 10|0 22.7% 63.5% 0 107G 90.3G 0|471 7|128 2.27k 815k 631 myReplicaSet10003 PRI Apr 21 18:57:50.628
|
*0 *0 *0 *0 0 11|0 22.7% 63.5% 0 107G 90.3G 0|471 7|128 2.19k 817k 631 myReplicaSet10003 PRI Apr 21 18:57:51.627
|
*0 *0 *0 *0 0 11|0 22.7% 63.5% 0 107G 90.3G 0|471 7|128 2.27k 817k 631 myReplicaSet10003 PRI Apr 21 18:57:52.628
|
|
|
now i convert psa to pss.the queue :
[root@srvdb1 mongod10003]# /var/lib/mongodb-mms-automation/mongodb-linux-x86_64-4.0.16/bin/mongostat --host 127.0.0.1:10003
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
25856 25826 25891 *0 157 304|0 20.3% 65.6% 0 107G 91.2G 0|342 5|128 52.2m 74.5m 632 myReplicaSet10003 PRI Apr 21 18:38:38.481
|
23212 23123 22885 *0 191 392|0 20.0% 65.3% 0 107G 91.2G 0|0 12|30 46.5m 63.7m 632 myReplicaSet10003 PRI Apr 21 18:38:39.476
|
35826 35804 35800 *0 247 448|0 20.2% 65.5% 0 107G 91.2G 0|0 1|0 72.3m 98.6m 632 myReplicaSet10003 PRI Apr 21 18:38:40.475
|
26407 26515 26700 *0 254 482|0 20.3% 65.6% 0 107G 91.2G 3|219 5|128 53.8m 72.3m 632 myReplicaSet10003 PRI Apr 21 18:38:41.476
|
28443 28442 28285 *0 184 243|0 20.2% 65.6% 0 107G 91.2G 0|0 3|14 57.1m 78.9m 632 myReplicaSet10003 PRI Apr 21 18:38:42.475
|
30288 30404 30463 *0 322 626|0 20.3% 65.6% 0 107G 91.2G 0|295 6|128 61.6m 83.2m 632 myReplicaSet10003 PRI Apr 21 18:38:43.478
|
30380 30270 30158 *0 289 525|0 20.3% 65.6% 0 107G 91.2G 4|0 3|71 61.1m 83.6m 632 myReplicaSet10003 PRI Apr 21 18:38:44.475
|
16017 16107 16311 *0 188 362|0 19.8% 65.4% 0 107G 91.2G 0|372 2|128 32.8m 43.6m 632 myReplicaSet10003 PRI Apr 21 18:38:45.475
|
32029 31990 33204 *0 253 489|0 20.0% 65.3% 0 107G 91.2G 0|344 4|126 64.7m 89.4m 632 myReplicaSet10003 PRI Apr 21 18:38:46.476
|
35638 35403 35323 *0 356 654|0 20.2% 65.6% 0 107G 91.2G 0|0 11|43 71.7m 98.1m 632 myReplicaSet10003 PRI Apr 21 18:38:47.477
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
24123 24311 24392 *0 311 640|0 20.3% 65.7% 0 107G 91.2G 0|261 3|128 49.4m 66.8m 632 myReplicaSet10003 PRI Apr 21 18:38:48.475
|
26948 27041 27038 *0 250 406|0 20.2% 65.6% 0 107G 91.2G 0|326 3|128 54.6m 74.6m 632 myReplicaSet10003 PRI Apr 21 18:38:49.476
|
30815 30843 30787 *0 313 595|0 20.3% 65.7% 0 107G 91.2G 0|464 4|128 62.4m 84.8m 632 myReplicaSet10003 PRI Apr 21 18:38:50.475
|
34782 34697 34583 *0 363 707|0 20.2% 65.6% 0 107G 91.2G 1|126 6|126 70.1m 95.8m 632 myReplicaSet10003 PRI Apr 21 18:38:51.476
|
21161 21121 21266 *0 220 451|0 20.3% 65.7% 0 107G 91.2G 0|270 2|128 43.0m 58.4m 632 myReplicaSet10003 PRI Apr 21 18:38:52.476
|
33797 33766 33682 *0 356 742|0 20.3% 65.7% 0 107G 91.2G 0|66 5|126 68.3m 93.1m 633 myReplicaSet10003 PRI Apr 21 18:38:53.476
|
25138 25234 25290 *0 304 583|0 20.3% 65.7% 0 107G 91.2G 0|342 1|128 51.3m 69.7m 633 myReplicaSet10003 PRI Apr 21 18:38:54.474
|
24853 24841 24894 *0 302 589|0 20.3% 65.7% 0 107G 91.2G 0|326 2|128 50.4m 68.2m 633 myReplicaSet10003 PRI Apr 21 18:38:55.474
|
27201 27338 27144 *0 327 719|0 20.3% 65.7% 0 107G 91.2G 0|472 1|128 55.3m 75.8m 633 myReplicaSet10003 PRI Apr 21 18:38:56.474
|
28100 27988 27999 *0 299 564|0 20.3% 65.7% 0 107G 91.2G 0|182 5|128 56.7m 77.2m 633 myReplicaSet10003 PRI Apr 21 18:38:57.476
|
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
|
30744 30697 30771 *0 257 499|0 20.2% 65.6% 0 107G 91.2G 0|122 6|128 62.1m 84.9m 633 myReplicaSet10003 PRI Apr 21 18:38:58.475
|
23751 23757 23867 *0 284 578|0 20.3% 65.7% 0 107G 91.2G 0|365 4|128 48.4m 65.5m 633 myReplicaSet10003 PRI Apr 21 18:38:59.477
|
30903 30961 30817 *0 217 366|0 20.2% 65.6% 0 107G 91.2G 0|311 1|128 62.3m 85.6m 633 myReplicaSet10003 PRI Apr 21 18:39:00.474
|
25081 24904 24948 *0 347 674|0 20.2% 65.7% 0 107G 91.2G 15|0 2|128 50.6m 68.6m 633 myReplicaSet10003 PRI Apr 21 18:39:01.478
|
|
|
However, the most impacting activity is related to the truncation of the profiling table
this information from ftdc?
|
|
thank you,why i add new myShard_3 replicat set,it has a litte connection, myShard_3's ops is 3000,that port is 57017
|
|
601290552@qq.com, I had a look at the provided data and I see several issues mainly around moving chunks activities.
However, the most impacting activity is related to the truncation of the profiling table:
- This is from srvdb1.yto.cloud server after the readConcern:majority was disabled
- Between A-B and C-D we see most of the queued write operations (as you've reported) which is also correlated to internal metadata locking
- The most interesting evidence here is the cursor truncate calls metric which stands for truncations per second on a capped collection. Table we use to store profiling data is a capped collection
Please disable profiling, let the workload run for a day or two and upload the fresh data from srvdb1.yto.cloud for us to review (no need to upload data from the rest of the servers). This might not solve all the observed issues but hopefully will remove one of the main bottlenecks.
Thanks,
Dima
|
|
Hi 601290552@qq.com, thanks for pointing out the shard name where you have disabled the readConcern:majority. I am looking at the data now and will update you as I progress.
To you questions:
if i convert psa to pss,whether i need to disable readconcern:major for pss?
No, you do not need to disable the readConcern:majority for PSS architecture.
|
|
if i convert psa to pss,whether i need to disable readconcern:major for pss?when i add new myShard_3 to cluster,but there is just a little connection to new replicat set? how to rebalance connection to new replicat set?
|
|
[root@srvdb5 mongod27019]# more automation-mongod.conf
- THIS FILE IS MAINTAINED BY http://10.5.130.57:8080 . DO NOT MODIFY AS IT WILL BE OVERWRITTEN.
- To make changes to your MongoDB deployment, please visit http://10.5.130.57:8080 . Your Group ID is 5e6f4cfb9d1f143e469e89df .
net:
bindIp: 0.0.0.0
maxIncomingConnections: 65535
port: 27008
serviceExecutor: adaptive
operationProfiling:
mode: slowOp
slowOpThresholdMs: 100
processManagement:
fork: "true"
replication:
enableMajorityReadConcern: false
oplogSizeMB: 81920
replSetName: myShard_2
sharding:
clusterRole: shardsvr
storage:
dbPath: /disk_data/test/mongod27019
engine: wiredTiger
wiredTiger:
collectionConfig: {}
engineConfig:
cacheSizeGB: 100
directoryForIndexes: true
indexConfig: {}
systemLog:
destination: file
path: /disk_data/test/mongod27019/mongodb.log
[root@srvdb5 mongod27019]#
|
|
please see readconcern.png attachment,i disable this parameter for cluster.
|
|
"storageEngine" :
{
"name" : "wiredTiger",
"supportsCommittedReads" : false,
"supportsSnapshotReadConcern" : true,
"readOnly" : false,
"persistent" : true
}
,
|
|
Hi 601290552@qq.com, thank you for uploading all the requested information.
Based on the uploaded logs, you still have PSA architecture with readConcern:majority enabled:
2020-04-20T13:30:32.680+0800 I CONTROL [conn1517] Replica Set Member State: ARBITER
|
As mentioned in this comment, this is not a recommended configuration. You need to either convert it to PSS or disable readConcern:majority.
Also, running with database profiling enabled under heavy load is not recommended as it might impact cluster's performance.
Thanks,
Dima
|
|
mongos> sh.status()
— Sharding Status —
sharding version:
{
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("5e7c6d892a7805d574a0820b")
}
shards:
{ "_id" : "myShard_1", "host" : "myShard_1/srvdb1.yyto.cloud:27007,srvdb2.yyto.cloud:27007", "state" : 1 }
{ "_id" : "myShard_2", "host" : "myShard_2/srvdb2.yyto.cloud:27008,srvdb5.yyto.cloud:27008", "state" : 1 }
{ "_id" : "myShard_3", "host" : "myShard_3/srvdb2.yyto.cloud:57017,srvdb6.yyto.cloud:57017", "state" : 1 }
active mongoses:
"4.0.16" : 3
autosplit:
Currently enabled: yes
balancer:
Currently enabled: no
Currently running: no
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
815 : Success
96 : Failed with error 'aborted', from myShard_2 to myShard_3
5 : Failed with error 'aborted', from myShard_1 to myShard_3
databases:
{ "_id" : "config", "primary" : "config", "partitioned" : true }
config.system.sessions
shard key:
{ "_id" : 1 }
unique: false
balancing: true
chunks:
myShard_1 1
{ "_id" :
{ "$minKey" : 1 }
} -->> { "_id" :
{ "$maxKey" : 1 }
} on : myShard_1 Timestamp(2, 0)
{ "_id" : "expMonitordb", "primary" : "myShard_1", "partitioned" : true, "version" :
{ "uuid" : UUID("af7418f5-b931-47c7-b397-776395dcec0f"), "lastMod" : 1 }
}
expMonitordb.lostMonitor
shard key:
{ "no" : "hashed" }
unique: false
balancing: true
chunks:
myShard_1 17909
myShard_2 18791
myShard_3 902
too many chunks to print, use verbose if you want to force print
|
|
finished upload files:mongoftdc.tar.gz
|
|
hello:
i will upload files to you,but compress files has 2.5G,so it slow.
now i add new replica set to mongo cluster.now mongo cluster has three shard,but new replica has small connection,so ops just 2000/s,other node has 30000 ops.
now i stop rebalance.but still just 2000 ops.
|
|
Hi 601290552@qq.com,
Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) from all members of this replica set and upload them to this support uploader location?
Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.
Thanks,
Dima
|
|
hello:
With a three-member PSS architecture for shard cluster,is it still have the same issue of cache pressure ?so i need to disable readConcern:majority ?tks.
|
|
Hi 601290552@qq.com,
Having a secondary node running but occasionally lagging might still have the same issue of cache pressure.
The impact of disabling readConcern:majority is documented under the same link I posted in my last comment, here it is for convenience.
Thanks,
Dima
|
|
if i disable readConcern:majority,then readConcern is local?
change to false,it will exists some problems for mongo cluster?
|
|
With a three-member PSA architecture, the cache pressure will increase if any data bearing node is down. To prevent the storage cache pressure from immobilizing a deployment with a PSA architecture, you can disable read concern by setting either:
my env: my secondary node is running.it still has cache pressure on primary
|
|
hello:
i check my mongo cluster,readConcern is majority,but other mongo cluster is pss,readConcern also is majority?how to deal with it?
if i change readConcern:majority to false? it affect my mongo cluster?
|
|
Hi 601290552@qq.com,
Thank you for the report. Could you please clarify what MongoDB bug you are reporting here?
In your comment, you've mentioned this replica set uses arbiters, and (I assume) readConcern:majority is enabled. This is not a recommended configuration. Please see our documentation how to address this issue.
Thanks,
Dima
|
|
[root@srvdb5 log]# mloginfo --queries mongodb.log
source: mongodb.log
host: unknown
start: 2020 Apr 18 17:55:17.289
end: 2020 Apr 18 17:56:28.866
date format: iso8601-local
length: 247052
binary: unknown
version: >= 3.0 (iso8601 format, level, component)
storage: unknown
QUERIES
namespace operation pattern count min (ms) max (ms) mean (ms) 95%-ile (ms) sum (ms)
expMonitordb.lostMonitor update None 81732 100 1353 198.16910145353106 n/a 16196757
local.oplog.rs find {"ts": 1} 2 2015 58705 30360.0 n/a 60720
local.oplog.rs find None 1 372 372 372.0 n/a 372
expMonitordb.lostMonitor find {"no": 1} 1 110 110 110.0 n/a 110
[root@srvdb5 log]#
|
Generated at Thu Feb 08 05:14:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.