[SERVER-29024] Collection drop slows/stalls down database operations Created: 30/Apr/17 Updated: 21/Jun/17 Resolved: 23/May/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Aristarkh Zagorodnikov | Assignee: | Alexander Gorrod |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Single-socket physical machine |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Dropping collection that is "active" (I'm guessing that used means not checkpointed), possibly if it's concurrent with an active checkpoint, leads to a visible slowdown (queries that took tens of milliseconds start taking hundreds an thousands of milliseconds) and bumps up the checkpoint time. We were hit by this since we upgraded to 3.4, and while the upgrade to 3.4.4 (
Attaching the subject log. The WT checkpoint time which is 3.5 sec average for an hour before/after the event spiked to 25 sec (we've seen 45 sec checkpoints in previous days). |
| Comments |
| Comment by Alexander Gorrod [ 23/May/17 ] |
It seems as though we haven't done a good job of explaining why some drops cause a temporary slowdown across all databases. The reason is that there is a cache of references to collections that is used to avoid the overhead of opening and closing those references for each operation. When a drop happens that cache becomes invalidated, and it can take a long time to re-populate the cache. The work in |
| Comment by Aristarkh Zagorodnikov [ 04/May/17 ] |
|
Added more logs and diagnostic.data from another instance. This time even drop of an inactive collection (at 2017-05-04T03:01:55.739+0300) took >7 seconds, stalling other queries to different collections/databases for ~5 seconds. Please take a look, because this doesn't look like a cursor problem to me, we usually don't have long-running queries for this instance. |
| Comment by Alexander Gorrod [ 01/May/17 ] |
|
It looks to me as though this is an instance where I believe the longer checkpoint time is a consequence of the cursor cache flush/repopulate. Pictures below from the two instances I saw in the diagnostic data: This is the shorter of the two drop outs - there was also a change in workload prior to the drop that means there was additional load on the server.
and This is the longer of the two drop outs, and it can be seen that re-opening the cursors takes quite a long time after the cache is flushed.
|
| Comment by Aristarkh Zagorodnikov [ 01/May/17 ] |
|
Did cut out some less recent metrics to fit the 150MB limt, but I believe it would be enough. |
| Comment by Kelsey Schubert [ 01/May/17 ] |
|
Hi onyxmaster, Thank you for reporting this issue. To help us investigate, would you please upload an archive of the diagnostic.data of the affected mongod? Thanks again, |