[SERVER-30426] dropDatabase very slow due to repeated calls to closeAllCursors Created: 31/Jul/17  Updated: 30/Oct/23  Resolved: 21/May/18

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.2.15, 3.4.6
Fix Version/s: 4.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Backlog - Storage Execution Team
Resolution: Fixed Votes: 29
Labels: 3.7BackgroundTask
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-33122 add option to limit WiredTiger cursor... Closed
Duplicate
Related
is related to SERVER-30238 High disk usage and query blocking on... Closed
is related to SERVER-31101 WT table not dropped after collection... Closed
is related to SERVER-27347 Only close idle cached cursors on the... Closed
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Case:

 Description   

Example:

  • dropDatabase with about 70 collections with 2 indexes each for a total of about 200 WT tables
  • about 325k open cursors
  • dropDatabase takes about 30 seconds and holds the global lock the entire time.

Collecting perf data shows mongod using 100% of a CPU for the duration, all in this stack:

mongo::WiredTigerSession::closeAllCursors(std::__cxx11::basic_string<...> const&)
mongo::WiredTigerSessionCache::closeAllCursors(std::__cxx11::basic_string<...> const&)
mongo::WiredTigerKVEngine::_drop(mongo::StringData)
mongo::WiredTigerKVEngine::dropIdent(mongo::OperationContext*, mongo::StringData)

For each of the 200 WT tables dropped we call closeAllCursors for that table. The perf stacks show all the time is spent in closeAllCursors itself, and the ftdc metrics also show that we aren't actually closing any cursors, so it appears that all the time is spent by closeAllCursors scanning the list of 325k open cursors to find cursors with a matching table uri.



 Comments   
Comment by Eric Milkie [ 23/May/18 ]

No, it takes effect immediately after the next cursor operation completes.

Comment by Michael [ 23/May/18 ]

Does the mongo server need to be restarted in order for the new parameter value to take effect?

Comment by Eric Milkie [ 21/May/18 ]

After SERVER-33378 and SERVER-32424, the number of cached cursors in each session is limited to the cursors that were used in the session since it was last checked out. Our testing shows that this has greatly improved the performance of calls to WiredTigerSession::closeAllCursors().

Comment by Eric Milkie [ 21/May/18 ]

In the linked server ticket SERVER-33122, you can find the parameter name: "wiredTigerCursorCacheSize".
In addition, we implemented a better cursor caching strategy in version 4.0, so I will be resolve this ticket now.

Comment by Michael [ 21/May/18 ]

We're actively investigating a temporary fix for this issue that would involve a server parameter to optionally turn off cursor caching.

@milkie, what is the parameter?

Comment by Eric Milkie [ 25/Oct/17 ]

I forgot to mention, the server parameter could be backported to 3.4, but the new drop behavior will not be.

Comment by Eric Milkie [ 25/Oct/17 ]

We're actively investigating a temporary fix for this issue that would involve a server parameter to optionally turn off cursor caching. This would result in a performance drop of up to 10%, but would completely avoid the drop issues and CPU consumption during collection and database drops.

In addition, in version 3.6, we have enhanced the way dropDatabase works so that it effectively drops all collections in the database first, without holding the global lock, before dropping the now-empty database. This won't affect the CPU consumption but it will improve server availability by significantly reducing the amount of time the global lock is held.

Comment by Alexander Gorrod [ 23/Oct/17 ]

The discussion in SERVER-31101 is relevant to this ticket as well. It's likely that the fix for both issues will be related.

Comment by Aaron Queen [ 20/Oct/17 ]

I believe we're experiencing this on 3.4.9 as well. db.dropDatabase() on a roughly 2mb db takes minutes and essentially brings all dbs to a halt. Server has two processors, and we see close to 100% cpu on a single cpu where it just seems like something is spinning forever.

I'd like to know if this will be fixed soon and backported to 3.4 as it affects our operations when dropping databases. Is the proposed workaround for now to drop each collection individually, then the empty db afterward?

Comment by David Henderson [ 22/Sep/17 ]

Any progress on this issue - is it likely to hit a 3.4 release in the near future?

Comment by Bruce Lucas (Inactive) [ 30/Aug/17 ]

This problem was introduced by SERVER-27347. Prior to that we would close all cached WT cursors on a failed table drop. The new behavior is better for a single table drop because it only closes relevant cursors, but may perform worse for a large number of failed table drops in a short period of time because it will repeatedly scan the list of cursors looking for the relevant cursors to close.

Comment by Richard Yates [ 31/Jul/17 ]

Gotcha, thanks. I thought for second we had a serious open cursor issue!

Let me/David know if we can be of further assistance with this one.

– Richard.

Comment by Bruce Lucas (Inactive) [ 31/Jul/17 ]

The 325k cursors here are internal: mongod maintains a cache of WT cursors to reduce the cost involved in creating and destroying WT cursors during normal operation. These cursors are WT cursors and are not the same thing as mongod client cursors, which is the number you are looking at. The size of that that cache grows to depends on factors like the number of open connections and the number of concurrent collections in use in the past, so it is expected that this number could be quite a bit larger than the number of mongodb client cursors currently open.

Comment by Richard Yates [ 31/Jul/17 ]

Hi Bruce,

A quick question if I may. The 325k open cursors, are those internal cursors or actual cursors from connections? We use mongodb cloud manager and it's saying the perticular server only has 2 open cursors so we're a little confused.

– Richard.

Generated at Thu Feb 08 04:23:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.