[SERVER-34377] Possible memory leak Created: 08/Apr/18 Updated: 24/Apr/18 Resolved: 24/Apr/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.6.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tarvi Pillessaar | Assignee: | Donald Anderson |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
HW: AWS c4.2xlarge |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Storage Non-NYC 2018-05-07 | ||||||||
| Participants: | |||||||||
| Description |
|
After upgrading one of our mongodb cluster to 3.6.3 I noticed that mongodb is leaking memory over time or something (maybe it was leaking also in 3.4) I haven't restarted this host yet, I can collect more information if needed.
wiredTiger
|
| Comments |
| Comment by Tarvi Pillessaar [ 24/Apr/18 ] | ||||||||||||
|
Thank you. I will test it when it gets into 3.6 release. | ||||||||||||
| Comment by Donald Anderson [ 24/Apr/18 ] | ||||||||||||
|
We believe the cause has been identified as | ||||||||||||
| Comment by Donald Anderson [ 23/Apr/18 ] | ||||||||||||
|
From a conversation with keith.bostic about this: | ||||||||||||
| Comment by Donald Anderson [ 23/Apr/18 ] | ||||||||||||
|
alexander.gorrod, I just put a conditional print in __txn_next_op where the realloc happens to see if txn->mod_alloc ever gets big. With that, I ran test_truncate01.py with scenario 0 - that forces only the 'simple' data set to be used, no WT indices. I see it with other scenarios too. For these tests, the highest I see txn->mod_alloc getting up to is 122880. Here's the stack when that happens:
This is running the WT from the MongoDB 3.6.3 drop. txn.i has been modified in this tree, so ignore the line numbers in the first couple frames. Notable in the stack is this:
So the presumption is that a truncate will create a lot of txn ops, growing the txn->mod array, which effectively won't be freed. 122880 * 100 bytes = 12M, and that's for a test with 1000 entries. Eventually, lots of sessions can build up these numbers. | ||||||||||||
| Comment by Keith Bostic (Inactive) [ 23/Apr/18 ] | ||||||||||||
|
donald.anderson, the path you point out is interesting because it might explain the significance of this being a truncate operation, we do slightly different work underneath the update when it's a truncate operation. I don't see anything that's not being freed, but it might be worth a second look. | ||||||||||||
| Comment by Keith Bostic (Inactive) [ 23/Apr/18 ] | ||||||||||||
|
alexander.gorrod, I think donald.anderson raises an interesting point, should we be calling __wt_txn_destroy from __wt_session_release_resources, that is, freeing the snapshot and the list of modifications when resetting a session? The snapshot is relatively large, 8B for each possible session. | ||||||||||||
| Comment by Keith Bostic (Inactive) [ 20/Apr/18 ] | ||||||||||||
|
Looking at the stack, once we get into __wt_row_modify(), we're in the same update code that everything else uses, that is, an update by a truncate is no different than an update for any other reason. It's hard to imagine that truncate is leaking memory while other updates aren't. | ||||||||||||
| Comment by Keith Bostic (Inactive) [ 20/Apr/18 ] | ||||||||||||
|
The memory allocated as part of WT_PAGE_ALLOC_AND_SWAP won't be freed until the page is discarded, here. I don't think there's any adjustment when it's freed. It's a relatively large allocation (a slot per key on the page), and so it's not adjusted based on WT_UPDATE_MEMSIZE. | ||||||||||||
| Comment by Tarvi Pillessaar [ 13/Apr/18 ] | ||||||||||||
|
Hi, I have uploaded log and diagnostic.data. At the moment WT reports:
this means more than 3GB outside the WT cache. Thank you, | ||||||||||||
| Comment by Bruce Lucas (Inactive) [ 10/Apr/18 ] | ||||||||||||
|
Hi Tarvi, It looks like mongod has allocated about 7 GB of data outside the WT cache, which should not happen. To help us investigate, can you please restart mongod with --setParameter heapProfilingEnabled=true. Then after a couple days of runtime please upload
This will record information that should enable us to track the source of the memory increase. For this purpose it is important that we have complete logs and diagnostic.data covering the time since the restart. Since the required files may be too large to attach to this ticket, I've generated a secure upload portal for you. This may be a slow steady leak, so after a uploading the initial couple days of data, please leave the instance running in that state in case we need a longer time period to get a clear signal. Thanks, | ||||||||||||
| Comment by Tarvi Pillessaar [ 09/Apr/18 ] | ||||||||||||
|
I analyzed one week of mongodb log, this is what I found:
First number shows connection count. Our main application is using nodejs. pymongo 3.4.0 is used by some checks scripts and stats collections tools, these things basically connect run few queries and then disconnect. | ||||||||||||
| Comment by Daniel Pasette (Inactive) [ 08/Apr/18 ] | ||||||||||||
|
Can you say which client driver and version you are using? |