[SERVER-39355] Collection drops can block the server for long periods Created: 01/Feb/19  Updated: 28/Feb/19  Resolved: 14/Feb/19

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.4.14
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Eric Milkie Assignee: Donald Anderson
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2019-02-06 at 12.01.40 PM.png     Text File mongo_drop_log.txt    
Issue Links:
Related
is related to SERVER-27700 WT secondary performance drops to nea... Closed
is related to SERVER-32890 Background index creation sometimes b... Closed
is related to WT-1598 Remove the schema, table locks Closed
is related to SERVER-32424 Use WiredTiger cursor caching Closed
is related to SERVER-38779 Build a mechanism to periodically cle... Closed
Operating System: ALL
Sprint: Storage Engines 2019-02-25
Participants:
Story Points: 1

 Description   

Hi, sorry but we've just had another occurrence today (still running 3.4.13) so there's still an issue here. We've modified our code to drop collection to sleep 10 sec between each deletion (to give mongo some time to recover after the "short" global lock and not kill the platform) but unfortunately this wasn't enough and it killed the global performance:

After investigation I found that this was cause by some collection deletion. I tried to upload the diagnostic.data but the portal specified earlier doesn't accept files any more. I can upload it if you give another portal.

Here is the log from the drop queries: mongo_drop_log.txt, we can see here that they are spaced by 10sec (+drop duration) and that the drop take A LOT of time (all these collections were empty or had 5 records at most). They had some indexes though, which are not shown here but probably had to be destroyed at the same time. I don't know if it's a checkpoint global lock issue again but it's definitely still not possible to drop collection in a big 3.4.13 mongo without killing it. For the record we have ~40k namespaces, this has not changed much since the db.stats I reported above.

And before you say this is probably fixed in a more recent version, we'll need better proof than last time considering the high risk of upgrading...



 Comments   
Comment by Donald Anderson [ 14/Feb/19 ]

bigbourin@gmail.com, I understand. I'm going to close this ticket, please reopen if you need any more help on this.

Comment by Adrien Jarthon [ 13/Feb/19 ]

I see, thanks for the details. Looks like a possible cause indeed, we'll try to let you know after we update to 3.6 but we're kind of skeptic because of all the trouble we had with mongo upgrades in the past and all the regressions there were in 3.6 so far.

Comment by Donald Anderson [ 12/Feb/19 ]

bigbourin@gmail.comSERVER-32424 and SERVER-38779 are relevant to this discussion. The use of WiredTiger cursor caching available in MongoDB 3.6/4.0 and enabled by SERVER-32424 was designed to address the problem of collections that cannot be dropped. Before those changes, every MongoDB session could hold up to 10000 cursors open, whether or not they are currently being used by that session. Those cursors are held open in the cursor cache even when the session is cached and not in use, and when the session is reused to service new requests. While this generally avoids expensive cursor opens, the downside is that any open cursor on a table prevents the collection associated with that table from being dropped. We believe that is what is happening here. There large number of open cursors are is one piece of evidence.

The switch to use WT cursor caching that is enabled by SERVER-32424 pushes the caching of cursors down to the WT session. We have reference counting on the WT cursors and keep "active" and "passive" (only cached) reference counts to the underlying tables. WiredTiger allow drops of tables that have no active references. When a MongoDB session finishes processing a request, cursors used in that request are cached in WT, which changes the active references to passive. The end result is that a collection can be dropped as soon as requests that use that collection complete.

SERVER-38779 is also part of this set of fixes, it closes old MongoDB sessions that have been idle for a long time. While these sessions may not prevent collections from being dropped, they will still hold cached cursors to the underlying tables as passive references, and that prevents the actual files with the dropped table data from being removed, as well as preventing some internal data structures ("dhandles") from being freed.

Both SERVER-32424 and SERVER-38779 are a part of the 3.6 and 4.0 release. Because of this and the connection that you see between the server blocking and collection drops, we think that an upgrade should help.

Comment by Adrien Jarthon [ 04/Feb/19 ]

Thanks, the file is uploaded.

Comment by Kelsey Schubert [ 01/Feb/19 ]

Secure upload portal for this issue.

Comment by Eric Milkie [ 01/Feb/19 ]

kelsey.schubert can you set up a new portal for Adrien to upload the diagnostic data?

Generated at Thu Feb 08 04:51:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.