[SERVER-25328] Performance issues with WiredTiger - huge latency peaks Created: 29/Jul/16  Updated: 06/Dec/17  Resolved: 07/Nov/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: 3.4.0-rc1

Type: Bug Priority: Major - P3
Reporter: Piotr Bochynski Assignee: Xiangyu Yao (Inactive)
Resolution: Done Votes: 4
Labels: 3.7BackgroundTask, RF
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive SERVER-25328.zip     PNG File Screenshot-mongo3.2.12.png     PNG File Screenshot-mongo3.2.8.png     PNG File Screenshot-mongo3.4.10.png     PNG File Screenshot-mongo3.6.png     PNG File Screenshot1mongo3.2.7.png     PNG File Screenshot2mongo3.2.7syncdelay0.png     PNG File Screenshot3mongo3.2.1.png     PNG File checkpoints.png     Zip Archive diagnostic-data-SERVER-25328.zip    
Issue Links:
Related
is related to WT-2831 Skip creating a checkpoint if there h... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Storage 2017-11-13
Participants:

 Description   

We are storing a huge number of collections (like thousands) in our databases. We are planning to migrate our Mongo storages from MMAPv1 to WiredTiger, but before doing that, we did a bunch of performance tests on MongoDB 3.2.7 and 3.2.8. We created test dataset with a large number of collections (30 000) and written a test, which performs only read by id operations. The results showed latency peaks (please see the attached screenshot 1). The test was executed on the following hardware configurations:

  • replica set with 3 nodes deployed on Amazon EC2 with SSD drives using XFS file system
  • one MongoDB instance run on MackBook Pro

We observed similar performance characteristic for both configurations used. After reading the docs and tuning WiredTiger configuration, we discovered, that the peaks are probably caused by periodic flush memory to disk (fsync). We tried to set syncdelay option to zero (which is actually not recommended) and noticed that performance was better but peaks are still there (please see attached screenshot 2).

In order to reproduce the problem please use the attached zip file containing the following:

  • Mongo shell script to easily load the test data
  • minimal REST service, that makes calls to Mongo
  • Gatling simulation, which makes calls to the REST service

Steps to reproduce:
1. Load test data with MongoDB shell script
2. Run REST service
3. Run Gatling simulation
4. Notice the recurring peaks on latency charts For more detailed instructions on how to run it, please see README.txt in the zip.

We also ran the above tests on Mongo 3.2.1. We conducted multiple tests both locally and on our machines on AWS and the performance is OK, there are no peaks. The results can be seen on screenshot 3.



 Comments   
Comment by Xiangyu Yao (Inactive) [ 03/Nov/17 ]

I retested this workload with mongod-3.2.8 on my local Linux machine and verified there are latency peaks for checkpoints.

And then I retested this workload with our latest mongod-3.6 and got the results shown in the following screenshot. It seems the latency peaks are gone now.

I also tested it on mongod-3.2.12 and mongod-3.4.10 and verified that they don't have this issue.
mongod-3.2.12:

mongod-3.4.10:

Comment by Alexander Gorrod [ 12/Aug/16 ]

The issue identified here should be improved in query only workloads by the change outlined in WT-2831

Comment by Piotr Bochynski [ 05/Aug/16 ]

Hi Thomas,
We ran the tests several times (without restarting mongod) and we always observed peaks during background flush with mongo 3.2.7. Interesting thing is that test performs only read operations (no writes at all). And the same test executed on mongo 3.2.1 gives much better results (10 ms vs 100 ms).
So right now we have stuck with version 3.2.1 and we cannot upgrade mongo.

Thank you,
Piotr

Comment by Kelsey Schubert [ 03/Aug/16 ]

Hi p.bochynski@gmail.com,

I'd like to quickly explain what we're observing in this simulation. When a checkpoint begins it blocks access to the tables' "access points" so it can get started. As there are a lot of tables in this dataset this is a little time consuming.

This lock does not block existing connections from accessing tables that they have already touched, but can cause delays when each connection to MongoDB tries to gather data from a table it has not yet accessed. In our work with your test application, we saw that with subsequent executions of the suite the max access time was under 15ms.

You can confirm this behavior by rerunning the test without restarting the mongod instance.

I am moving this ticket to the WiredTiger team's backlog as we discuss next steps. Thank you again for excellent report; the reproduction greatly helped our investigation.

Best regards,
Thomas

Comment by Kelsey Schubert [ 29/Jul/16 ]

Hi p.bochynski@gmail.com,

Thank you for the very detailed bug report with clear reproduction steps. We are investigating this issue and will update this ticket when we know more.

Best regards,
Thomas

Comment by Piotr Bochynski [ 29/Jul/16 ]

Tests to reproduce the issue.

Generated at Thu Feb 08 04:08:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.