[SERVER-27045] Over 40% regression on mongodb using YCSB workload - follow on to server-26674 Created: 15/Nov/16 Updated: 23/Jan/17 Resolved: 05/Dec/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, WiredTiger |
| Affects Version/s: | 3.4.0-rc3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Lilian Romero | Assignee: | Michael Cahill (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | How to reproduce: System used: Power with 256GB of memory. |
||||||||
| Sprint: | Storage 2016-12-12 | ||||||||
| Participants: | |||||||||
| Description |
|
We opened ticket
The vmstat with version 3.4rc3 shows high and lows where as with 3.3.8 it is very even. |
| Comments |
| Comment by Jim Van Fleet [ 29/Nov/16 ] | |||||||||||||||||||
|
Well, 5% is almost within measurement variance. I ran agan today and got profile, etc data – I did not see the symptoms from earler (widely varying cpu usage and calls to eviction). Throughput was down a bit because of the data collection, but I think we are ok for now. One thought, though, is to make the parameters we used to be defaults - perhaps/ OK to close | |||||||||||||||||||
| Comment by Alexander Gorrod [ 28/Nov/16 ] | |||||||||||||||||||
|
Thanks for the new data jvf; from the numbers you posted the most recent result is 5% slower than the numbers achieved on the 3.3.8 development release. I have analyzed the diagnostic data you uploaded and throughput is very stable on this most recent run. I could not diagnose any further bottle necks related to WiredTiger cache management. There have been a lot of changes that improve correctness and functionality between 3.3.8 and 3.4.0 - as such it would be difficult to isolate the causes of this performance difference. Are you happy for this ticket to be closed now, with the knowledge that MongoDB will continue to improve performance in the future, and that YCSB is one of the benchmarks we use to track that improvement? | |||||||||||||||||||
| Comment by Jim Van Fleet [ 22/Nov/16 ] | |||||||||||||||||||
|
diagnostic data | |||||||||||||||||||
| Comment by Jim Van Fleet [ 22/Nov/16 ] | |||||||||||||||||||
|
ran with those changes, improved a little to 286722.60 ops/sec. Diagnostic data attached. | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 22/Nov/16 ] | |||||||||||||||||||
|
jvf thanks for the additional information. A couple of things I noticed about the workload:
You should be able to restore the 3.3.8 behavior with the following settings:
Note that these settings will result in checkpoints taking a long time to complete, leading to additional disk requirements for journal data and slower recovery after an unclean shutdown. For this workload, they should avoid eviction entirely because the dirty working set will stay in cache. | |||||||||||||||||||
| Comment by Jim Van Fleet [ 21/Nov/16 ] | |||||||||||||||||||
|
ran today – same parms, same-ish results 262966.39 ops/sec. Diagnostic data attached. | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 20/Nov/16 ] | |||||||||||||||||||
|
jvf, thanks, it's good to hear that this setting helped. Can you upload the diagnostic.data from that run so I can see where the remaining 13% is going? | |||||||||||||||||||
| Comment by Jim Van Fleet [ 18/Nov/16 ] | |||||||||||||||||||
|
I ran with that change and the result improved to 262908.22 ops/sec which is still about 13% down from 3.3.8. You are certainly on the right track! – just a little bit more. Can you suggest different trigger percentages to experiment with? Other parameters? | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 18/Nov/16 ] | |||||||||||||||||||
|
lilianr@us.ibm.com, thank you for the data, I have been looking into this today. This reduction in throughput is caused by some changes to limit how much of the WiredTiger cache is permitted to be dirty by default. In the 3.3.8 runs, much of the cache is dirty, leading to variable memory use and checkpoints taking a very long time (over 10 minutes each). The 3.4.0-rc4 runs here are behaving better in terms of memory use and checkpoints completing, but the stalls are unexpected and unacceptable. If you are able to run further tests with 3.4.0 release candidates, can you please try starting MongoDB with the following setting:
That should largely restore the 3.3.8 behavior: if you can run that test, please let me know how the results look. I have reproduced this effect and will work to improve the behavior with the default settings. I will let you know when I have more information. | |||||||||||||||||||
| Comment by Lilian Romero [ 16/Nov/16 ] | |||||||||||||||||||
|
The config file used is the default.
| |||||||||||||||||||
| Comment by Lilian Romero [ 16/Nov/16 ] | |||||||||||||||||||
|
The attachment contains the diagnostic data for version 3.3.8 | |||||||||||||||||||
| Comment by Lilian Romero [ 15/Nov/16 ] | |||||||||||||||||||
|
The attachment contains the diagnostics data for 3.4.cr3 | |||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 15/Nov/16 ] | |||||||||||||||||||
|
Hi Lillian, thanks for the report. A couple questions:
|