[SERVER-9674] Replication related memory leak, exposed by server side js was Authenticated Connections leaking memory or aggregation leak. Created: 13/May/13 Updated: 10/Dec/14 Resolved: 06/Mar/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Security |
| Affects Version/s: | 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Michael Grundy | Assignee: | Michael Grundy |
| Resolution: | Cannot Reproduce | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux x86_64, replica set, auth turned on |
||
| Issue Links: |
|
|||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||
| Steps To Reproduce: | Drive a couple thousand connections to a repl set primary. Authenticate. Let the connections close. When all the connections are closed you will notice the resident size and non-mapped memory creeping up Using the following python and shell scripts to build a connection, authenticate, then repeat:
Ran on 8 AWS guests to drive connections on the Primary:
|
|||||||||||||||||||
| Participants: | ||||||||||||||||||||
| Description |
|
Over time something related to replication is causing the resident memory set (apparently tied to non-mapped memory) to grow large enough to cause swapping. After correlating different instances of this issue, we found that the easiest way to reproduce is to run a workload with server side javascript (map reduce, $where, group(), and eval() ). When v8 instance are instantiated, the non-mapped memory will balloon, this isn't the problem; though it is frequently the first thing that people notice. After running for a while you can see the non-mapped memory has increased over time (checking at times where there was no js running). I don't think it is an allocation issue around v8 as it doesn't happen without replication. You can see in the mongostat output how the non-mapped memory creeps up after each test run (Connections and memory burst up during a test run, then connections drop to 1 [mongostat] but the non-mapped memory increases a little each time):
I pulled out the highlights from a longer run so it's easier to spot |
| Comments |
| Comment by Michael Grundy [ 06/Mar/14 ] | ||||||||||||||||||||
|
This was opened to track a set of symptoms that were later found to be separate issues: | ||||||||||||||||||||
| Comment by Andy Schwerin [ 06/Mar/14 ] | ||||||||||||||||||||
|
michael.grundy@10gen.com, does the resource leak still repro? | ||||||||||||||||||||
| Comment by Michael Grundy [ 02/Aug/13 ] | ||||||||||||||||||||
|
I've updated the description and reproduction steps with the most reliable way we've been able to reproduce this. Most of the reporters are using some kind of server side javascript, though at least one isn't. | ||||||||||||||||||||
| Comment by Matt Vella [ 05/Jun/13 ] | ||||||||||||||||||||
|
One thing that was noted in CS-7099 was that the growth in non-mapped memory mostly correlated with the growth in accesses not in memory and page fault exceptions. Are you testing in an environment where data is read from disk sometimes? | ||||||||||||||||||||
| Comment by Michael Grundy [ 04/Jun/13 ] | ||||||||||||||||||||
|
The chances of this being related to auth alone are slim. I've run an extensive series of tests and what I initially thought was a reproduction, was the case that Spencer mentions above. The non-mapped memory set will grow as the connection count increases, but then stabilizes. Additionally, I ran a series of multi day tests that consisted of high connection counts each running multiple aggregations. I ran variations of the test using multiple shells started from bash, parallel shells launched from js, and multithreaded python programs. After the initial warmup, the memory footprint was consistently stable. | ||||||||||||||||||||
| Comment by Matt Vella [ 03/Jun/13 ] | ||||||||||||||||||||
|
Any updates here? This bug is keeping us from using mongo 2.4.x in production. Thanks. | ||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 17/May/13 ] | ||||||||||||||||||||
|
Here's the python program I used:
| ||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 17/May/13 ] | ||||||||||||||||||||
|
I just tested with a simple python program that creates 5000 connections, authenticates all of them, does a real simple aggregation on each one, then closes them and repeats. I saw a little weird behavior in that after the first round of 5000 connections, when they were closed the res memory jumped from ~200MB to ~1GB. But after that it stabilized at 1.14g and has been stable for about 15 minutes now, so still no luck reproducing the real memory leak. | ||||||||||||||||||||
| Comment by Michael Grundy [ 17/May/13 ] | ||||||||||||||||||||
|
Spencer had trouble reproducing this so we thought that maybe it is aggregation causing it. I tried reproducing it with a simpler test and didn't have the same leaky behavior. I then tried with an aggregation case I was using when I originally saw the behavior and did not have the same success. After reflecting on this, it occurs to me that after the initial issue replication, we were trying to reproduce this with lower connection counts. The original reproduction was done with four machines driving connections and workload to the primary. | ||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 14/May/13 ] | ||||||||||||||||||||
|
Assigning back to Mike to confirm this is definitely an auth issue and get a consistent repro. | ||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 13/May/13 ] | ||||||||||||||||||||
|
confirmed on both counts. | ||||||||||||||||||||
| Comment by Andy Schwerin [ 13/May/13 ] | ||||||||||||||||||||
|
michael.grundy@10gen.com, Have you run tests to confirm a replicaset required? Authentication? That is, if this were the test, would you see the memory leak without replication? With the .auth() line removed?
|