[SERVER-43906] WiredTigerLAS.wt keeps growing daily Created: 09/Oct/19 Updated: 18/Dec/19 Resolved: 18/Dec/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Aleksandar Puskas | Assignee: | Dmitry Agranat |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
I am running MongoDB 4.2.0 on Ubuntu 18.04.3 LTS, single server, no replicas. This is hosted in AWS on a m4.4xlarge instance type (16 cores, 64GB RAM) with 2.5TB volume (type io1, 8000 IOPS). The CloudWatch metrics look completely normal. CPU usage seems to be between 20-25%. After upgrading to 4.2.0 I started observing abnormal growth of WiredTigerLAS.wt. The only way to reclaim disk space is to restart the service. This needs to happen every day or two. In just under two days it has grown to 270GB. This can hardly be considered normal. I am willing to share diagnostics data to get to the bottom of this. I have found a similar issue but it does not seem to apply. Also, I do not see a resolution in the existing issue. |
| Comments |
| Comment by Dmitry Agranat [ 18/Dec/19 ] |
|
Hi aleksandarpuskas@gmail.com, thank you for reporting this issue. I'll go ahead and close this ticket. As a reminder, we recommend to address the "hot pages" issue in your application to avoid the symptom of the reported LAS file overhead. Regards, |
| Comment by Aleksandar Puskas [ 17/Dec/19 ] |
|
Thank you. I suppose we can close the issue now. I really hope we are done. |
| Comment by Dmitry Agranat [ 17/Dec/19 ] |
|
Hi aleksandarpuskas@gmail.com, Normally, we will try to reuse the free_space in LAS file, even under heavier workload, as can be seen from your latest data, we did this successfully for the last 4 days. I believe the issue with the unbound LAS file growth was addressed in 4.2.2, do you have any further questions about this issue? Thanks, |
| Comment by Aleksandar Puskas [ 17/Dec/19 ] |
|
Hello Dima, Thank you for the explanation. Since the file will never shrink (just like before), does it mean it can still grow very large after even heavier workload? For instance, it has now reached 27GB. Can it grow beyond this if the database happens to be under even heavier write workload in say a few days? |
| Comment by Dmitry Agranat [ 17/Dec/19 ] |
|
Hi aleksandarpuskas@gmail.com, thank you for keeping us up to date on this issue. After reviewing the latest data from 4.2.2, I can confirm that We can still see the same write-heavy (updates) workload putting a lot of pressure on the WT cache (as previously explained, due to hot pages). This, as expected, results in cache overflow mechanism kicking in to store historical document versions for all updates more recent than the oldest “history window” pinned in the cache. As you can see, the cache overflow table entries metric grows until the end of the update workload. However, unlike before running on MongoDB 4.2.2, now we are removing entries from the WiredTigerLAS.wt file, see cache overflow table remove calls metric. What's more importantly, the WiredTigerLAS.wt on disk size does not grow unbound any more (stays flat), see cache overflow table on-disk size metric, meaning that after removing some entries, we are successfully reusing LAS file's free_space. This is the expected behavior: the file should not continue to grow, even under heavy write workload, the file should reach it's steady state. It won't shrink, but it shouldn't grow either. Thanks, |
| Comment by Aleksandar Puskas [ 17/Dec/19 ] |
|
Hi Dima, I have just uploaded latest diagnostics data after the recent upgrade. Please let me know what you see. Thanks. |
| Comment by Dmitry Agranat [ 11/Dec/19 ] |
|
Hi aleksandarpuskas@gmail.com, yes, you can stop the create/drop script on MongoDB 4.2.2. Please upload the usual data after a couple of days for us to review. Thanks, |
| Comment by Aleksandar Puskas [ 11/Dec/19 ] |
|
Hi Dima, I have just upgraded. I will watch this over the next few days and let you know. I sincerely hope this is resolved now. Shall I stop the periodical collection create/drop now? Thank you. |
| Comment by Dmitry Agranat [ 10/Dec/19 ] |
|
Hi aleksandarpuskas@gmail.com, MongoDB 4.2.2 was just released, could you try it and let us know if it fixes the issue? |
| Comment by Dmitry Agranat [ 28/Nov/19 ] |
|
Hi aleksandarpuskas@gmail.com, both issues appear to be fixed in MongoDB 4.2.2. |
| Comment by Aleksandar Puskas [ 24/Nov/19 ] |
|
Yes, it seems the file size is back to normal now. Is there any ETA for the resolution of this issue? Thank you. |
| Comment by Dmitry Agranat [ 24/Nov/19 ] |
|
Hi aleksandarpuskas@gmail.com, based on the recent data, the WiredTigerLAS.wt file was removed after the process restart at 2019-11-21T07:40:05. Could you please check again? |
| Comment by Aleksandar Puskas [ 24/Nov/19 ] |
|
I have just uploaded new diagnostic data. |
| Comment by Dmitry Agranat [ 24/Nov/19 ] |
|
Hi aleksandarpuskas@gmail.com, looking at the diagnostic data you've uploaded I do not see any reboot for the last 7 days (data ends by Oct 21st, 07:12) Maybe it was done after Oct 21st, 07:12 UTC? If yes, could you upload the latest diagnostic.data which will cover this? |
| Comment by Aleksandar Puskas [ 21/Nov/19 ] |
|
Just so you know I have done an urgent reboot to get rid of the huge file but the file remained. It didn't shrink after service was restarted as would have hoped. How can I recycle the file? I have also uploaded the latest diagnostics data.
|
| Comment by Dmitry Agranat [ 18/Nov/19 ] |
|
Thanks aleksandarpuskas@gmail.com for keeping us up to date with the latest information. We are aiming to address this issue by |
| Comment by Aleksandar Puskas [ 15/Nov/19 ] |
|
Hi Dima, I have just uploaded latest diagnostics file. Please note that today I observed a significant drop in WiredTigerLAS.wt file size. After being a couple of hundred gigabytes it has now shrunk to just below 17 gigabytes. This seems to have happened somewhere in the last 24-48 hours. Please let me know what you make of it. The temp collection drop hack is still in place but it didn't seem to matter when I first deployed it.
|
| Comment by Susan LoVerso [ 04/Nov/19 ] |
|
I will try to think of other suggestions. The tickets that are open to address this issue are |
| Comment by Dmitry Agranat [ 03/Nov/19 ] |
|
Thanks Alexandar, we are looking at the latest data and will update you soon on our findings. |
| Comment by Aleksandar Puskas [ 03/Nov/19 ] |
|
After only half a day of running a script that creates/drops a temp collection on a "local" database, I am seeing increase of WiredTigerLAS.wt size. Please see the before/after sizes below. Also, the matching diagnostics data was just uploaded.
|
| Comment by Aleksandar Puskas [ 02/Nov/19 ] |
|
Hi Dima, I have just uploaded the latest diagnostics data. The file WiredTigerLAS.wt has grown to ~253GB. I will now let the suggested workaround run for a few days and report back. Thanks! |
| Comment by Dmitry Agranat [ 24/Oct/19 ] |
|
Hi Alexandar, Before the upgrade to 4.2.1, we have observed the same behavior with the WiredTigerLAS.wt growth. Though the growth was more sustained, the file did grew to 5.5 GB. The biggest difference I can see, as compared to the previous workload, is that during this time, the issue with the "hot documents" was less severe. Specifically, the maximum page size at eviction metric was almost always around 8 MB (as opposed to 73 MB as we've seen before). After the upgrade, on 23rd, and up until today, the workload is almost idle so we do not see any issues. If the workload is back to execute heavy updates, I suggest implementing the above mentioned create/drop sequence and letting this run for a few days. We can review the data once again to determine the next steps. Thanks, |
| Comment by Aleksandar Puskas [ 24/Oct/19 ] |
|
I have just uploaded new data. Thanks Dima for all the hard work you are putting in to help me resolve this. |
| Comment by Dmitry Agranat [ 24/Oct/19 ] |
|
Thank you for the update Alexandar. Could you upload the latest diagnostic.data so that we can review and provide an up to date recommendations? |
| Comment by Aleksandar Puskas [ 23/Oct/19 ] |
|
Hi Dima, I can absolutely do this create/drop sequence but since the database has been really stable for at least a week or more, is it really necessary to start immediately or do we wait for the problems to begin first? |
| Comment by Dmitry Agranat [ 23/Oct/19 ] |
|
Hi Aleksander, I completely understand the difficulties with the downgrade. I suggest we stay on 4.2 and continue the troubleshooting.
Yes, the workaround is to create and drop a temp collection every few minutes. This collection does not need to be in any specific database. This workaround would help us to understand if we are in the right direction with identifying the root cause and fixing this. Thanks, |
| Comment by Aleksandar Puskas [ 23/Oct/19 ] |
|
Hello Dima, Although there are backups of data before the upgrade, it is virtually impossible for me to get those backups up and running because the server is too critical to be able to play with it in such a fashion. If I was to spin a new server just to bring up old database, I wouldn't be able to replicate the load or the pattern that the live server is experiencing. I have been monitoring this issue closely and even though I could still see the problem a few days after I initially reported, for many days now the problem is gone sort of. The file has grown to a manageable size but stayed there for a long time. Today I did an upgrade to 4.2.1 and will continue to monitor for issues. Are you saying that the workaround is to create and drop a temp collection every few minutes? Does the collection need to be in any specific database? Why does that help? Thank you!
|
| Comment by Dmitry Agranat [ 23/Oct/19 ] |
|
Hi Aleksander, While we are waiting for data from the previous MongoDB version to determine if this is a regression, I've mentioned that you might consider temporarily downgrading. There might be another way to workaround this issue on 4.2. The workaround is to create and drop some collection every 5 minutes. Thanks, |
| Comment by Keith Bostic (Inactive) [ 20/Oct/19 ] |
Yes, but I'd do it more often than that, maybe once every 5 minutes? |
| Comment by Keith Bostic (Inactive) [ 18/Oct/19 ] |
|
dmitry.agranat, I'm suspicious as well that something else is happening here other than
This isn't necessarily the case, but at the least the file should not continue to grow, that is, even under heavy write workload the file should reach steady state. It won't shrink, but it shouldn't grow either. |
| Comment by Dmitry Agranat [ 16/Oct/19 ] |
|
Hi Aleksander, I am sorry to hear you are having these issues. In order for us to be able to determine if this is a regression, please provide the same data for a MongoDB version you've used before moving to 4.2 (if it's still available). Just to summarize what we see based on the provided information.
Alternatively, while we are waiting for data from the previous MongoDB version to determine if this is a regression, you might consider temporarily downgrading. If the data from the previous MongoDB version is not available any more, downgrading to that previous MongoDB version and providing us with the requested information will also help us to progress this investigation. Once we are able to prove or refute if this is a regression, we would also be able to answer the rest of the questions in your last comment. Thanks, |
| Comment by Aleksandar Puskas [ 10/Oct/19 ] |
|
Hi Dima, Is this introduced in 4.2.0? I do not recall having this issue before. Is there a way to turn this off or change behavior? Whatever the apps are doing (it's not a single app) is difficult to change now and caused no issues in the past versions. Also, I do not understand how 64GB RAM isn't enough. Any recommendations on how to improve hardware to stop this from happening? Does the file ever automatically shrink? I have noticed it goes away only after the service is restarted. I can hardly restart the service daily just to get rid of this cache file. This feels like a huge bug. Growing the file by 150GB each day cannot be called a feature. The apps depend heavily on MongoDB but this starts me thinking I might want to move away and look for other solutions. Can you please tell me why concurrent updates to the same document would be bad? Is there a document that advises against this practice? Do you have any particular doc id that I can take a look at? Thanks! |
| Comment by Dmitry Agranat [ 10/Oct/19 ] |
|
Thanks Aleksander, The growth of the WiredTigerLAS.wt file is related to the Cache overflow feature. Cache overflow is kind of like swap for the WiredTiger cache, it utilizes disk to store updates to documents by flushing cache from memory to disk. In cases of extreme cache pressure, we begin using the cache overflow table which aims to prevent us from getting stuck. We store historical document versions for all updates more recent than the oldest “history window” pinned in the cache. From a quick review of the data provided, it looks like there are a lot of concurrent writes to the same document which creates this cache pressure, which is turn results in the WiredTigerLAS.wt file growth. Could you review your application logic to figure out the reason behind these concurrent updates to the same document? Thanks, |
| Comment by Aleksandar Puskas [ 10/Oct/19 ] |
|
I have just uploaded the requested files. Please note that I remove the logs regularly as they grow very big too. If you need more logs for investigative purposes I can collect them over the next days. Thank you. |
| Comment by Dmitry Agranat [ 10/Oct/19 ] |
|
Hi aleksandarpuskas@gmail.com, Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Thanks, |