[SERVER-29024] Collection drop slows/stalls down database operations Created: 30/Apr/17  Updated: 21/Jun/17  Resolved: 23/May/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Alexander Gorrod
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Single-socket physical machine
XFS on a SATA SSD
Ubuntu 14.04 x64, 4.4 (HWE) kernel
MongoDB 3.4.4


Attachments: PNG File Screen Shot 2017-05-01 at 3.26.51 pm.png     PNG File Screen Shot 2017-05-01 at 3.27.47 pm.png     File diagnostic.data-db5.tar     File diagnostic.data.tar     PNG File event-diagnostics.png     File mongodb-db2.log.gz     File mongodb-drive2-db5.log.gz    
Issue Links:
Duplicate
duplicates SERVER-27347 Only close idle cached cursors on the... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Dropping collection that is "active" (I'm guessing that used means not checkpointed), possibly if it's concurrent with an active checkpoint, leads to a visible slowdown (queries that took tens of milliseconds start taking hundreds an thousands of milliseconds) and bumps up the checkpoint time. We were hit by this since we upgraded to 3.4, and while the upgrade to 3.4.4 (WT-3207) helped with "inactive" collections (we create/drop collections that are grouped per-day), but dropping "active" collections still leads to latency spikes, and since the system is sized to latency N, when latency becomes 100*N it just can't cope with the load.
The worst part of it that entire database instance is affected, not just the collection or database in question.

SERVER-22199 + WT-2333 and WT-2334 seem to be related.

Attaching the subject log. The WT checkpoint time which is 3.5 sec average for an hour before/after the event spiked to 25 sec (we've seen 45 sec checkpoints in previous days).



 Comments   
Comment by Alexander Gorrod [ 23/May/17 ]

This time even drop of an inactive collection took >7 seconds, stalling other queries to different collections/databases for ~5 seconds. Please take a look, because this doesn't look like a cursor problem to me, we usually don't have long-running queries for this instance.

It seems as though we haven't done a good job of explaining why some drops cause a temporary slowdown across all databases. The reason is that there is a cache of references to collections that is used to avoid the overhead of opening and closing those references for each operation. When a drop happens that cache becomes invalidated, and it can take a long time to re-populate the cache. The work in SERVER-27347 aims to mitigate the cache invalidation when a drop is done - which should also mitigate the symptoms you are reporting.

Comment by Aristarkh Zagorodnikov [ 04/May/17 ]

Added more logs and diagnostic.data from another instance. This time even drop of an inactive collection (at 2017-05-04T03:01:55.739+0300) took >7 seconds, stalling other queries to different collections/databases for ~5 seconds. Please take a look, because this doesn't look like a cursor problem to me, we usually don't have long-running queries for this instance.

Comment by Alexander Gorrod [ 01/May/17 ]

It looks to me as though this is an instance where SERVER-27347 would help. The drop call causes MongoDB to flush the cursor cache, operations that follow shortly after the drop need to pay the cost of opening new cursors, rather than retrieving one from the cache. That causes performance to slow down, and the application consequently has more operations running in parallel than normal which require more cursors to be opened - since each connection uses it's own cursor. The particular statistic that indicates this problem is: ss wt session open cursor count you can see that it drops very low at the same time as the drop, then grows again after the drop.

I believe the longer checkpoint time is a consequence of the cursor cache flush/repopulate.

Pictures below from the two instances I saw in the diagnostic data:

This is the shorter of the two drop outs - there was also a change in workload prior to the drop that means there was additional load on the server.

and

This is the longer of the two drop outs, and it can be seen that re-opening the cursors takes quite a long time after the cache is flushed.

Comment by Aristarkh Zagorodnikov [ 01/May/17 ]

Did cut out some less recent metrics to fit the 150MB limt, but I believe it would be enough.

Comment by Kelsey Schubert [ 01/May/17 ]

Hi onyxmaster,

Thank you for reporting this issue. To help us investigate, would you please upload an archive of the diagnostic.data of the affected mongod?

Thanks again,
Thomas

Generated at Thu Feb 08 04:19:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.