[SERVER-25070] Significant delays when large collections are responsible for full WiredTiger cache Created: 14/Jul/16  Updated: 01/Feb/17  Resolved: 30/Jan/17

Status: Closed
Project: Core Server
Component/s: Performance, WiredTiger
Affects Version/s: 3.2.7, 3.2.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Chad Kreimendahl Assignee: Kelsey Schubert
Resolution: Done Votes: 3
Labels: RF
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File During-DiagnosticData1.png    
Issue Links:
Duplicate
is duplicated by SERVER-25663 Odd connection timeouts and rejection... Closed
is duplicated by SERVER-25760 Mongodump taking extraordinarily long... Closed
Operating System: ALL
Participants:

 Description   

We are unable to recreate this problem in 3.2.5. It appears to have existed since 3.2.7.

One of our database systems which houses approximately 90 databases, most of which are less than 2GB in size, encounters an extreme degredation in performance when a particularly large collection (12GB) is loaded into the cache. The cacheSizeGB on this system is 32GB.

The problem only occurs if the WiredTiger cache is full AND the large collection appears loaded into the cache (speculation on the last part). In this scenario, requests that typically take 1ms begin to take 50-500ms, and the slow request log blossoms quickly.

We can replicate the above scenario by performing a simple query on the large collection when the WiredTiger cache is at or near its cache limit.

We've also been able to verify that the slowness does not happen when the database is otherwise taxed with a full cache. So we can execute a very long running map/reduce on other data and not impact performance. Only when the giant collections are loaded do we see the issue.



 Comments   
Comment by Chad Kreimendahl [ 31/Jan/17 ]

That works. We've not seen the problem since 3.2.10.

Comment by Kelsey Schubert [ 30/Jan/17 ]

Hi sallgeud,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Thomas

Comment by Kelsey Schubert [ 27/Sep/16 ]

Hi sallgeud,

I wrote a very detailed comment on SERVER-26055, which describes the some of the fixes to eviction in WiredTiger that will be included in the upcoming release, MongoDB 3.2.10.

In particular, I think this issue may be corrected by the third improvement that I noted. If this is the case, it is likely that it would also improve the behavior in SERVER-25760 and SERVER-25663.

I would recommend upgrading to MongoDB 3.2.10 when it is released to confirm if this fixes the issue you are observing. Please note that mongoDB 3.2.10-rc2 was released today if you would like to test it, and we expect that MongoDB 3.2.10 will be released early next week.

Thank you,
Thomas

Comment by Chad Kreimendahl [ 03/Aug/16 ]

The workload basically stays the same. We can cause the problem by running that mapreduce on the massive collection (8GB+). The mapreduce is just an aggregation and validation of data, so it's only reads, no writes.

Comment by Kelsey Schubert [ 03/Aug/16 ]

Hi sallgeud,

In the last 2 hours preceding the restart (13:40), we're seeing a significant change the load. To help us continue to investigate this issue, would please provide additional details about your workload?

  • Can you clarify the change in the workload around this time?
  • In your description, you mention executing a long running map-reduce, is this when it starts? Does the map-reduce operation write to a collection?
  • Would you please provide the logs covering this time period? I have created a secure upload portal for you to use here.

Thank you,
Thomas

Comment by Chad Kreimendahl [ 03/Aug/16 ]

From our perspective the slow queries happen before the delay in the secondaries. What's strange to us is that the secondaries would fall behind at all when 90%+ of the commands running are queries and getmores. Are the secondaries getting lagged because the primary won't respond quickly enough?

Comment by Kelsey Schubert [ 02/Aug/16 ]

Hi sallgeud,

We are continuing to investigate this issue. However, I can provide some preliminary findings. We believe that SERVER-23622 is contributing to the behavior you are observing. There are two lagging replica set members, which causes the volume of the oplog in cache to grow. This is likely a positive feedback loop – cache pressure causes slow operations on the primary, which causes the secondaries tailing the oplog to lag, leading to more cache pressure on the primary and further slowness.

We currently are examining whether there are other issues that may be impacting the performance you are seeing in addition to SERVER-23622 and we will update you when we know more.

Thank you for your help,
Thomas

Comment by Chad Kreimendahl [ 02/Aug/16 ]

Were you able to glean anything of value from the uploaded diagnostics?

Comment by Daniel Pasette (Inactive) [ 15/Jul/16 ]

Hi Chad, thanks for the report. Would you be able to upload the diagnostic.data for the impacted server?

Comment by Chad Kreimendahl [ 14/Jul/16 ]

This appears to relate to SERVER-24775 .

Also, please replace my 3.2.6 with 3.2.5. I verified that our system ran 3.2.5 without issues, but the upgrade to 3.2.7 is where they began.

Generated at Thu Feb 08 04:08:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.