[SERVER-25070] Significant delays when large collections are responsible for full WiredTiger cache Created: 14/Jul/16 Updated: 01/Feb/17 Resolved: 30/Jan/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, WiredTiger |
| Affects Version/s: | 3.2.7, 3.2.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Chad Kreimendahl | Assignee: | Kelsey Schubert |
| Resolution: | Done | Votes: | 3 |
| Labels: | RF | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
We are unable to recreate this problem in 3.2.5. It appears to have existed since 3.2.7. One of our database systems which houses approximately 90 databases, most of which are less than 2GB in size, encounters an extreme degredation in performance when a particularly large collection (12GB) is loaded into the cache. The cacheSizeGB on this system is 32GB. The problem only occurs if the WiredTiger cache is full AND the large collection appears loaded into the cache (speculation on the last part). In this scenario, requests that typically take 1ms begin to take 50-500ms, and the slow request log blossoms quickly. We can replicate the above scenario by performing a simple query on the large collection when the WiredTiger cache is at or near its cache limit. We've also been able to verify that the slowness does not happen when the database is otherwise taxed with a full cache. So we can execute a very long running map/reduce on other data and not impact performance. Only when the giant collections are loaded do we see the issue. |
| Comments |
| Comment by Chad Kreimendahl [ 31/Jan/17 ] |
|
That works. We've not seen the problem since 3.2.10. |
| Comment by Kelsey Schubert [ 30/Jan/17 ] |
|
Hi sallgeud, We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket. Regards, |
| Comment by Kelsey Schubert [ 27/Sep/16 ] |
|
Hi sallgeud, I wrote a very detailed comment on In particular, I think this issue may be corrected by the third improvement that I noted. If this is the case, it is likely that it would also improve the behavior in I would recommend upgrading to MongoDB 3.2.10 when it is released to confirm if this fixes the issue you are observing. Please note that mongoDB 3.2.10-rc2 was released today if you would like to test it, and we expect that MongoDB 3.2.10 will be released early next week. Thank you, |
| Comment by Chad Kreimendahl [ 03/Aug/16 ] |
|
The workload basically stays the same. We can cause the problem by running that mapreduce on the massive collection (8GB+). The mapreduce is just an aggregation and validation of data, so it's only reads, no writes. |
| Comment by Kelsey Schubert [ 03/Aug/16 ] |
|
Hi sallgeud, In the last 2 hours preceding the restart (13:40), we're seeing a significant change the load. To help us continue to investigate this issue, would please provide additional details about your workload?
Thank you, |
| Comment by Chad Kreimendahl [ 03/Aug/16 ] |
|
From our perspective the slow queries happen before the delay in the secondaries. What's strange to us is that the secondaries would fall behind at all when 90%+ of the commands running are queries and getmores. Are the secondaries getting lagged because the primary won't respond quickly enough? |
| Comment by Kelsey Schubert [ 02/Aug/16 ] |
|
Hi sallgeud, We are continuing to investigate this issue. However, I can provide some preliminary findings. We believe that We currently are examining whether there are other issues that may be impacting the performance you are seeing in addition to Thank you for your help, |
| Comment by Chad Kreimendahl [ 02/Aug/16 ] |
|
Were you able to glean anything of value from the uploaded diagnostics? |
| Comment by Daniel Pasette (Inactive) [ 15/Jul/16 ] |
|
Hi Chad, thanks for the report. Would you be able to upload the diagnostic.data for the impacted server? |
| Comment by Chad Kreimendahl [ 14/Jul/16 ] |
|
This appears to relate to Also, please replace my 3.2.6 with 3.2.5. I verified that our system ran 3.2.5 without issues, but the upgrade to 3.2.7 is where they began. |