[SERVER-17542] Out of Memory crash with wiredTiger Created: 11/Mar/15  Updated: 09/Jun/15  Resolved: 21/Apr/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Chad Kreimendahl Assignee: Sam Kleinman (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-17324 NULL pointer exception during insert ... Closed
Operating System: ALL
Participants:

 Description   

We upgraded to 3.0.0 following the day of release and did a mongorestore to entirely rebuild all of our databases into wiredTiger's engine. It worked well for approximately 4 days of low-level usage before throwing an "out of memory" error during some regular (non-heavy) usage.

It doesn't appear as if this is related to SERVER-16902, as we're talking about less than 100,000 inserts, updates, deletes over several days, and nothing in bulk.

We're configured with 4GB of memory allocated to WT. Snappy compression is on, as is prefixCompression. The machine is virtual and dynamically allocates memory. It appears that the system was only using 8GB out of 12GB of potential memory at the time of crash.

 wiredTiger:
  engineConfig:
   cacheSizeGB: 4

2015-03-10T16:28:52.007-0500 I WRITE    [conn42060] remove Monitoring.RecordSearchIssue ndeleted:53714 keyUpdates:0 writeConflicts:0 numYields:2484 locks:{} 54255ms
2015-03-10T16:28:52.009-0500 I COMMAND  [conn42060] command Monitoring.$cmd command: delete { delete: "RecordSearchIssue", writeConcern: { fsync: false, j: true }, ordered: true, deletes: [ { q: {}, limit: 0 } ] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:40 locks:{} 54257ms
2015-03-10T16:28:52.369-0500 I COMMAND  [conn42060] command OnspringC1.$cmd command: listCollections { $query: { listCollections: 1 } } keyUpdates:0 writeConflicts:0 numYields:0 reslen:8432 locks:{} 289ms
2015-03-10T16:28:52.648-0500 I CONTROL  [conn42060] Stack trace failed, SymInitialize failed with error 3765269347
2015-03-10T16:28:52.648-0500 F -        [conn42060] out of memory.



 Comments   
Comment by Michael Templeman [ 09/Jun/15 ]

I have been regularly encountering an OOM error that I reported here:

https://jira.mongodb.org/browse/SERVER-16902

This also seems related to 16997

Comment by Ramon Fernandez Marina [ 21/Apr/15 ]

sallgeud, since you were not able to recreate the issue we're going to resolve this ticket. However you may want to tune into SERVER-17324 for updates on out-of-memory conditions in WiredTiger. See also this comment in SERVER-10044 for more details on the effect of the initial size of the pagefile on memory allocations.

Regards,
Ramón.

Comment by Chad Kreimendahl [ 10/Apr/15 ]

I was not able to recreate it after rerunning the tool that does this. We also just switched our dev system back to mmapv1 in preparation for regression testing of a release we have coming out. We'll switch back over to WT in a few weeks, once we get all of our testing done.

Comment by Ramon Fernandez Marina [ 10/Apr/15 ]

Hi sallgeud, have you had a chance of running with a pagefile that's allowed to grow up to 12GB? Any progress testing things on your end?

Thanks,
Ramón.

Comment by Chad Kreimendahl [ 20/Mar/15 ]

We have our classes keep track of their indexes (for the most part), so that shouldn't be a problem for us to just drop and recreate. Still have potential OOM issues. Going to attempt to run with updated code on drop by Monday and confirm memory usage.

Comment by Ramon Fernandez Marina [ 20/Mar/15 ]

Hi sallgeud,

on Windows if the pagefile is not being used it doesn't mean it wasn't needed – it may mean that Windows could not find enough memory for a given request (see SERVER-10044) – so I'd recommend you try Sam's suggestion of allowing the pagefile to grow to 12GB.

Note that for every document removal the index needs to be updated as well, so on a collection with a large number of documents with a bunch of indexes a remove() operation may take a long time and consume resources unnecessarily. Note that you can save the indexes and restore them later:

function restoreIndexes(element, index, array) {
    if (element.name != "_id_") {
        print('Restoring: ' + tojson(element.key));
        db.foo.ensureIndex(element.key);
    }
}
var indexes = db.foo.getIndexes();
db.foo.drop();
indexes.forEach(restoreIndexes);

This simple approach does not preserve index options though, so the recommended approach would be for your application to keep track of what indexes are needed. Hope this helps.

Comment by Chad Kreimendahl [ 20/Mar/15 ]

It doesn't appear as if we ever used more than ~250MB of swap/pagefile... or at least the system suggests that's the max its hit. I'm not terribly concerned about windows-only issues, as we're primarily not windows in our higher level environments with Mongo. General concern is the crash vs graceful recovery or just degradation, which would be more understandable if paging.

I'd be happy to Collection.Drop(). However, doesn't that also destroy any indexes that were created? We really need to keep all of the indexes while just nuking all of the data. More like a database style Truncate. So, drop is substantially less practical than removeAll, by our estimation.

From the docs:

Removes a collection from the database. The method also removes any indexes associated with the dropped collection. The method provides a wrapper around the drop command.

Yes, separate issue. Yes, version 3.0.1 still takes extraordinarily long to show dbs. Just mentioning it here because in getting info while debugging this, I find myself doing a ton of waiting

Comment by Sam Kleinman (Inactive) [ 20/Mar/15 ]

Thanks for the response, a couple of things:

  1. calling remove() with an empty query document is relatively expensive because MongoDB must perform the query and page all documents into memory just to delete them. Consider using a collection drop() method for these kinds of operations.
  2. we've encountered a number of issues with windows deployments where the size of a page file becomes a limiting factor because of the way that windows manages memory. You may be able to resolve this issue by increasing the size of the page file. Can you see if you can reproduce the crash with a larger page size, perhaps try with sizes between 4gb and 12gb?
  3. I believe that the latency on the show dbs operation (i.e. the listDatabases command) may be a distinct issue from the out of memory issue. Can you confirm that you're running a 3.0.x-series shell for this operation? Additionally, just to be sure, there aren't operations on other VMs on this machine that might be contending for storage access and that the slowdown with listDatabases/listCollections operation is stable after the first run. In general, we'll try to continue to address this issue as part of SERVER-17078.

Thanks for your patience!

Regards,
sam

Comment by Chad Kreimendahl [ 18/Mar/15 ]

Uncompressed data information (forgot to include):
[note: "show dbs" still takes approximately 30-60 seconds to run the first time and 20-30 sec subsequent when using wiredTiger with lots of collections]

  • 24 Databases
  • Most Under 100M in uncompressed data size
  • Largest:

    "collections" : 275,
    "objects" : 608596,
    "avgObjSize" : 1971.8336548383493,
    "dataSize" : 1200050075,
    "storageSize" : 345194496,
    "numExtents" : 0,
    "indexes" : 2951,
    "indexSize" : 154775552,
    "ok" : 1

  • Database with issue:

    "collections" : 7,
    "objects" : 88548,
    "avgObjSize" : 141.33360437276957,
    "dataSize" : 12514808,
    "storageSize" : 4640768,
    "numExtents" : 0,
    "indexes" : 7,
    "indexSize" : 2056192,
    "ok" : 1

Comment by Chad Kreimendahl [ 18/Mar/15 ]

How much storage is available to MongoDB in the dbPath directory?

Total of 128 GB. 80.1 GB Free

What is your data size? Both, in terms of the data files exactly, and the data size reported by dbStats?

WiredTiger directory (since I left the mmapv1 dir in place and used a new one for wt):

  • 21,333 files
  • 76 folders
  • 2.88 GB Actual
  • 3.90 GB On Disk (because of page sizes)

What is the size of the page file?

2GB

Is this system running directly on hardware or in some sort of virtual environment?

Hyper-V.

  • Server 2012 R2 host and virtual system.
  • Gen 1 Hyper-V
  • Dynamic Memory allocation (8GB min - 12GB max)
  • Recently changed to fixed 12GB
  • Solid state backed drives.

Have you tested this with the latest stable production release (3.0.1?)

Installed today. Encountered the race condition on shutdown vs service shutdown on windows, but otherwise ok. Will have to wait a week or so to know for sure.

Do you have a sense of what the operation (i.e. the delete operation on the second line of your log) was? It looks like you deleted 50k records just prior to the error. Would it be possible to try and recreate this issue to ensure that some aspect of the delete operation doesn't trigger the error?

There were no other log entries for approximately 10 minutes prior to those. It looks like we performed a Collection.RemoveAll() (C#), prior to rebuilding the content. So, a relatively simple delete of the collection that appears to have taken quite a long time. It's unknown whether it was in the middle of a checkpoint or some other back-end, as it happened when nobody was really using the system. This call is done once a day as part of a cleanup process while doing system analysis and comparison of indexing results. It runs automatically, so there were no live users on the system at the time

This is our dev system which is currently on the 1.10 version of the C# driver, which appears to call collection.delete with a null query (as also seen in the log). I'm unsure what would have cause the duration of the delete. I can very easily replicate this portion of it at any time by manually restarting our monitoring system.

Comment by Sam Kleinman (Inactive) [ 18/Mar/15 ]

Sorry that you've hit this error. Can you provide more information about the environment where this is running? The answers to the following questions may help us debug

  • How much storage is available to MongoDB in the dbPath directory?
  • What is your data size? Both, in terms of the data files exactly, and the data size reported by dbStats?
  • What is the size of the page file?
  • Is this system running directly on hardware or in some sort of virtual environment?
  • Have you tested this with the latest stable production release (3.0.1?)
  • Do you have a sense of what the operation (i.e. the delete operation on the second line of your log) was? It looks like you deleted 50k records just prior to the error. Would it be possible to try and recreate this issue to ensure that some aspect of the delete operation doesn't trigger the error?
Generated at Thu Feb 08 03:44:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.