[SERVER-38170] mongod stopped due to out of memory Created: 16/Nov/18 Updated: 03/Dec/18 Resolved: 29/Nov/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.0.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Laxman P | Assignee: | Danny Hatcher (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
Our application is using mongo database for several days without any issue related to memory. We use sharding cluster DB, with 2 shards and each of them using two replica sets + arbiter. Mongo Version 4.0.2 On Nov 15th 4:26 am UTC, mongod crashed due to memory issue on one of the primary mongod (VM is *02b) We would like to understand the root cause of the issue. We are using wiredTiger with a cacheSize of 2GB. Here are the sequence of events 1. Primary was 02b, and it was the first to fail due to out of memory (server has 3.75 GB and mongodb is configured with a wiredTiger cache of 2 GB, so it must be some other operation taking a lot of memory). The last exception was thrown while deviceLog was being purged, the cause is that mongod process stopped. Some snippets from 02b mongod log file
|
| Comments |
| Comment by Danny Hatcher (Inactive) [ 29/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello, For a record to be deleted it has to be pulled into cache and cannot be evicted until it is persisted to disk. If you are deleting more records, more data has to be pulled into cache and evicted when the process ends which can produce a waterfall effect if your server was already under the strain of regular load. If you continue to experience issues and are unable to improve your hardware, you may wish to restrict how many deletes happen at any one time. From your responses, I do not see anything to indicate a bug in the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources. Thank you, Danny | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Laxman P [ 29/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi all - Is there any update ? | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Laxman P [ 22/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The purge process is in a single thread where we remove old entries from several collections. The purge process run every day and maybe at this instance there were more records than usual (maybe in the order of < 100k entries) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 19/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello, From looking through the available diagnostic data, your replica set is simply overwhelmed by the load the application is sending to it. You mentioned that you were running a "purge process" when your problems started happening. Could you go into some detail as to what that entails? If your application was touching data that is not normally used, it has to draw that data into cache. That is an expensive operation and can cause the cache to fill up. I can see that the cache was struggling to evict old data in a timely matter and was forced to start using application threads to help. Disk utilization and wait times also became very high. All of these are signs that the nodes are under load that they cannot support. From the application-side, you can attempt to increase your write concern. While higher write concerns provide many robustness benefits, they also can cause queries to take longer to return to the application. In your case, this may also be a benefit because it provides a way for the system to self-"throttle" the traffic and may help prevent your nodes from becoming overloaded. If increasing the write concern is not sufficient to alleviate the problem, the best solution likely lies in allocating better hardware to the servers. I can see that CPU, disk, and RAM are all overloaded during the problem period; upgrading to a better machine will ensure your servers can handle the load. Thank you, Danny | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Laxman P [ 16/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Bruce - Data for secondary has also been uploaded now. CPU continues to be high for several hours now on the primary now. Can you provide some input on what "WT Reco.plog.rs" means and what the evict threads are meant to do (so we can try to figure out if there's a way to mitigate it from the application point of view) ? | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 16/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
laxmanpv, can you please also upload the diagnostic.data for the secondaries as well? The primary is being impacted by replication lag on the secondary and we'd like to see the cause for this. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Laxman P [ 16/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Another data point is that currently on 02b which is the only one in the replica set running, top -H shows a high cpu on "WT Reco.plog.rs" for several hours and the evict threads also being active
pstack snippet
| ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Laxman P [ 16/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Kelsey - I have uploaded the data as you requested. This is from mongo instance "02b" | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 16/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi laxmanpv, Thanks for your report. Would you please upload the an archive of the diagnostic.data directory and logs for nodes in the affected shard's replica set? For your convenience, I've created a secure upload portal that you can use to provide these files. This information will allow us to continue to investigate. Thanks, |