[SERVER-25328] Performance issues with WiredTiger - huge latency peaks Created: 29/Jul/16 Updated: 06/Dec/17 Resolved: 07/Nov/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | None |
| Fix Version/s: | 3.4.0-rc1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Piotr Bochynski | Assignee: | Xiangyu Yao (Inactive) |
| Resolution: | Done | Votes: | 4 |
| Labels: | 3.7BackgroundTask, RF | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Storage 2017-11-13 | ||||||||
| Participants: | |||||||||
| Description |
|
We are storing a huge number of collections (like thousands) in our databases. We are planning to migrate our Mongo storages from MMAPv1 to WiredTiger, but before doing that, we did a bunch of performance tests on MongoDB 3.2.7 and 3.2.8. We created test dataset with a large number of collections (30 000) and written a test, which performs only read by id operations. The results showed latency peaks (please see the attached screenshot 1). The test was executed on the following hardware configurations:
We observed similar performance characteristic for both configurations used. After reading the docs and tuning WiredTiger configuration, we discovered, that the peaks are probably caused by periodic flush memory to disk (fsync). We tried to set syncdelay option to zero (which is actually not recommended) and noticed that performance was better but peaks are still there (please see attached screenshot 2). In order to reproduce the problem please use the attached zip file containing the following:
Steps to reproduce: We also ran the above tests on Mongo 3.2.1. We conducted multiple tests both locally and on our machines on AWS and the performance is OK, there are no peaks. The results can be seen on screenshot 3. |
| Comments |
| Comment by Xiangyu Yao (Inactive) [ 03/Nov/17 ] |
|
I retested this workload with mongod-3.2.8 on my local Linux machine and verified there are latency peaks for checkpoints. And then I retested this workload with our latest mongod-3.6 and got the results shown in the following screenshot. It seems the latency peaks are gone now. I also tested it on mongod-3.2.12 and mongod-3.4.10 and verified that they don't have this issue. |
| Comment by Alexander Gorrod [ 12/Aug/16 ] |
|
The issue identified here should be improved in query only workloads by the change outlined in |
| Comment by Piotr Bochynski [ 05/Aug/16 ] |
|
Hi Thomas, Thank you, |
| Comment by Kelsey Schubert [ 03/Aug/16 ] |
|
I'd like to quickly explain what we're observing in this simulation. When a checkpoint begins it blocks access to the tables' "access points" so it can get started. As there are a lot of tables in this dataset this is a little time consuming. This lock does not block existing connections from accessing tables that they have already touched, but can cause delays when each connection to MongoDB tries to gather data from a table it has not yet accessed. In our work with your test application, we saw that with subsequent executions of the suite the max access time was under 15ms. You can confirm this behavior by rerunning the test without restarting the mongod instance. I am moving this ticket to the WiredTiger team's backlog as we discuss next steps. Thank you again for excellent report; the reproduction greatly helped our investigation. Best regards, |
| Comment by Kelsey Schubert [ 29/Jul/16 ] |
|
Thank you for the very detailed bug report with clear reproduction steps. We are investigating this issue and will update this ticket when we know more. Best regards, |
| Comment by Piotr Bochynski [ 29/Jul/16 ] |
|
Tests to reproduce the issue. |