[SERVER-32890] Background index creation sometimes block whole server for dozen of seconds (even with empty collections) Created: 25/Jan/18 Updated: 30/Oct/23 Resolved: 04/Dec/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.4.3 |
| Fix Version/s: | 3.4.6 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Adrien Jarthon | Assignee: | Eric Milkie |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | Storage NYC 2018-12-17 | ||||||||||||
| Participants: | |||||||||||||
| Case: | (copied to CRM) | ||||||||||||
| Description |
|
Ok this one is an issue we're having regularly for months but couldn't pinpoint the cause, so now that we have managed to collect more data I'm opening a ticket. The problem is that basically some operation (we recently reproduced it with index creation but also got the same impact with a collection drop) simply blocks the whole server during 10-30 seconds for no apparent reason. The CPU/RAM/IO are fine, and the query is not even reported as taking time in the slow query log. It seems like the server is doing something internally which takes times (maybe eviction, checkpoint, or whatever) and when this falls during a query which needs a global lock (even for a short time) it can hold this lock for way longer than expected. I captured everything I could, here is what I have: Here is the impact on the main application (database1): And on another application using a different database (database2): Here is an example of mongo log:
Here we can see:
Here is how it looked in mongostat:
Here we can see the number or active read rising to 256 before starting to queue some, and the number of queries dropping to 0, during the blocking time. I also ran a periodic db.currentOp() every 5 seconds in a file to make sure to capture the exact state in which the requests were during the lock. I can upload complete version to your private portal if you need but here's the interesting part:
At about 23:16:36+0100 we can see the query is still holding a database write exclusive lock, for more than 9 seconds now, without a single yield. It seem the global lock is only an intent write lock so I'm not sure if this can work concurrently but queries on other DBs seemed blocked too as you can see in the mongo log. 5 seconds later (about 23:16:41) we can see the index has now started being built in background for real this time and is not longer holding a write exclusive lock and is yielding this time:
Of course this 15sec delay (sometimes 30sec) can't be explained by slow IO as we're running on a healthy Raid 10 of 4 × 600G SSD. The IO usage on the machine is between 0 an 1%. We do have a lot of collections though, so it maybe be related to the work WT has to do when creating/removing namespace as we seems to encounter the issue on index/collection creation/deletion. I have extracted the diagnostic.data dir and can upload it to your portal if you want of course.
|
| Comments |
| Comment by Eric Milkie [ 01/Feb/19 ] | ||||||||||||||
|
Hi Adrien, I filed | ||||||||||||||
| Comment by Adrien Jarthon [ 01/Feb/19 ] | ||||||||||||||
|
Hi again, any chance to get this looking at? I still have the diagnostic.data ready to be uploaded. | ||||||||||||||
| Comment by Adrien Jarthon [ 17/Dec/18 ] | ||||||||||||||
|
Hi, sorry but we've just had another occurrence today (still running 3.4.13) so there's still an issue here. We've modified our code to drop collection to sleep 10 sec between each deletion (to give mongo some time to recover after the "short" global lock and not kill the platform) but unfortunately this wasn't enough and it killed the global performance:
After investigation I found that this was cause by some collection deletion. I tried to upload the diagnostic.data but the portal specified earlier doesn't accept files any more. I can upload it if you give another portal. Here is the log from the drop queries: mongo_drop_log.txt And before you say this is probably fixed in a more recent version, we'll need better proof than last time considering the high risk of upgrading... | ||||||||||||||
| Comment by Eric Milkie [ 04/Dec/18 ] | ||||||||||||||
|
bigbourin@gmail.com, after some investigation, I believe the tickets that improved your issue were I agree with you that our release notes for minor versions have in the past mixed performance enhancements, bug fixes, and cosmetic changes together without regard to severity. We are actively trying to improve our release notes in this regard, and strive to do better in the future. | ||||||||||||||
| Comment by Adrien Jarthon [ 06/Apr/18 ] | ||||||||||||||
|
Hello, for the record due to the lack of response and the urgency of the issue for us, we tried upgrading to mongodb 3.4.13 last month even though we couldn't find a clear changelog entry about this issue, and it seems to be much better now. We've created/deleted a couple indexes without any long locks since then. The checkpoint time reduced but not much so the impact is not just on this part, it seems something did improve in the way the lock is acquired or held. Though I must say we both could have save some time if the changelog was clearer about the important bug fixes and performance improvements. Even you were not able to tell me that this was fixed (or greatly improved) in the latest 3.4.x version. | ||||||||||||||
| Comment by Adrien Jarthon [ 21/Feb/18 ] | ||||||||||||||
|
Ah yes, thanks, I was able to compute the average checkpoint time using "transaction checkpoint total time (msecs)" / "transaction checkpoints" On our staging server (almost no traffic) I see avg checkpoint time at 500ms for ~5500 tables, which means it doesn't scale linearly on tables count, or at least it scales on something else too, does it depend on ops? On another server for my personal project (running 3.4.10) I see an avg checkpoint time at 1.5s for ~50 tables, which is an interesting number, although on a smaller machine it's still powered by SSD and neither CPU nor IO are throttling, I hope checkpoint performance did not degrade that much between 3.4.3 and 3.4.10 ☺ | ||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 21/Feb/18 ] | ||||||||||||||
|
Hi Adrien, You can monitor the following serverStatus metric to get an idea of how long checkpoints are taking:
Bruce | ||||||||||||||
| Comment by Adrien Jarthon [ 21/Feb/18 ] | ||||||||||||||
|
Ok great, glad it helped Do I have any way to measure or estimate the checkpoint duration without trying to block the production DB? | ||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 17/Feb/18 ] | ||||||||||||||
|
Hi Adrien, Yes, your hypothesis about the cause was good, and I should have mentioned that it helped in pinpointing the issue. We'll pass your suggestions on to the development team. Regarding the impact of the issue in your case - yes, I suspect that it could have to do with the relatively large number of tables (collections and indexes) in your case, as one phase of the checkpoint reviews this list while holding the schema lock. If you are able to reasonably reduce this number you may see shorter stalls; please let us know if that is the case. Thanks, | ||||||||||||||
| Comment by Adrien Jarthon [ 16/Feb/18 ] | ||||||||||||||
|
Thanks for this great explanation Bruce I guess this is more complicated than that of course but just in case I'll give my suggestions: this is a classic case of two different lock and the bigger one is held while waiting for the smaller one, I can think of a couple options:
It doesn't seem like a lot of people are complaining about this issue whereas it doesn't look very specific to our use-case, any idea of whats make us more prone to hit this issue? maybe the checkpoint process is very log for us? (I don't know what a normal value is, it seems to be running ~40% of the time for us?) is it because we have a lot of namespaces? would reducing this number help? Any other tips to reduce the checkpoint time? | ||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 16/Feb/18 ] | ||||||||||||||
|
Hi Adrien, Thanks for the report. I wanted to update you with some progress on analyzing this issue. During an index build we
It appears we are getting hung up creating the WT table while holding the db W lock during checkpoints because the WT schema lock is held for an extended period:
We're still considering next steps on this issue and will let you know when we have an update. Bruce | ||||||||||||||
| Comment by Adrien Jarthon [ 13/Feb/18 ] | ||||||||||||||
|
Hello Mark, mm this ticket seems to say it blocks the secondary when replicating the drop but in our case it also happens on the primary and not only when a collection/index is dropped, the first and most detailed report I put in this ticket only talk about index creation (no drop) and blocked the PRIMARY, so it's not the same thing IMO. | ||||||||||||||
| Comment by Mark Agarunov [ 13/Feb/18 ] | ||||||||||||||
|
Hello bigbourin@gmail.com, Thank you for providing the additional data and information. After digging into this further, I believe this is an instance of Thanks, | ||||||||||||||
| Comment by Adrien Jarthon [ 12/Feb/18 ] | ||||||||||||||
|
And finally to illustrate my point about this issue being present also in other cases, you can find in my last diagnostic.data a couple hours before the collection drop (at 15:35:30 UTC+01) a short lock of 8 sec blocking the entire server again for no apparent reason at all this time, there's no query holding a global lock visible in the slow query log, and the first queries to finish after the lock are benign lookups of single record with perfect index, they don't even show a lock wait time:
We have these kind of locks about 10 times a day. For the record we tried to restart the secondaries last week to see if it does any good in regard to this issue, and so far it looks like the secondaries did not experience this issue again. Since the restart the number of replication updates on the secondary is way more stable, whereas the number of updates on the primary didn't change: Not sure if that's of any help but it can at least confirm it's not a machine capacity issue and the mongo server seems to take time (or specific actions?) before entering this state. We do not have any precise point when this problem started of course, the locks just grew longer and more frequent with time. We'll probably try stepping down a primary to restart it too, if it can at least save a couple days/months without slowdown for our clients. Would you like us to do/check anything specific when we do? | ||||||||||||||
| Comment by Adrien Jarthon [ 12/Feb/18 ] | ||||||||||||||
|
It's worth noting that we have a lot of namespaces (colletions + indexes), I heart some improvements have been made / are in the work about this at MongoDB Europe 2017 ☺
| ||||||||||||||
| Comment by Adrien Jarthon [ 12/Feb/18 ] | ||||||||||||||
|
Sure, I've uploaded another diagnostic.data, in this one you can find multiple occurrences, for example last friday (09 feb) I delete 4 EMPTY collections, 2 of them went smoothly as expected, and 2 of them blocked not only the DB but the mongo server entirely for 14 sec and 35 sec !! Here are the two slow query log lines:
I uploaded the slow query log for the whole hour on the portal. Here is how it looks in our web requests queuing graph (time is UTC+1): Here is the impact on another app in a different DB: Here is how it looked in my mongo shell:
| ||||||||||||||
| Comment by Mark Agarunov [ 08/Feb/18 ] | ||||||||||||||
|
Hello bigbourin@gmail.com, Thank you for the additional information. We are still investigating this, but if you could provide the logs and/or diagnostic data from the other instances of this happening outside of index builds, it may help correlate what is common between them. Thanks, | ||||||||||||||
| Comment by Adrien Jarthon [ 08/Feb/18 ] | ||||||||||||||
|
For the record we're seeing lock like this not only during index creation but also collection creation, deletion and probably other operations. I guess that once the conditions are present, any operation requiring a global or db lock can trigger this issue. | ||||||||||||||
| Comment by Adrien Jarthon [ 29/Jan/18 ] | ||||||||||||||
|
Ok thanks, what's the D marker? | ||||||||||||||
| Comment by Mark Agarunov [ 26/Jan/18 ] | ||||||||||||||
|
Hello bigbourin@gmail.com, Thank you for providing this data. While we are still investigating this, I've found some correlations in the diagnostic data. At C the createIndex command is issued, soon after latency and queued reads spike, and we stop getting ftdc for a bit while it's blocked. E to F doesn't appear to be the same, as only lag increases while the other metrics stay flat. I will provide further updates as the investigation into this continues. Thanks, | ||||||||||||||
| Comment by Adrien Jarthon [ 25/Jan/18 ] | ||||||||||||||
|
Done, I've uploaded the diagnostic.data dir, the mongo log for the day (24/01) and my file containing the raw output of db.currentOp every 5 seconds. | ||||||||||||||
| Comment by Kelsey Schubert [ 25/Jan/18 ] | ||||||||||||||
|
I've created a secure portal for you to attach files related to this ticket. Would you please upload the diagnostic.data as well as the complete logs covering the same period? Thank you, | ||||||||||||||
| Comment by Adrien Jarthon [ 25/Jan/18 ] | ||||||||||||||
|
Here a couple more charts about the machine ressources yesterday evening: |