[SERVER-9674] Replication related memory leak, exposed by server side js was Authenticated Connections leaking memory or aggregation leak. Created: 13/May/13  Updated: 10/Dec/14  Resolved: 06/Mar/14

Status: Closed
Project: Core Server
Component/s: Security
Affects Version/s: 2.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Grundy Assignee: Michael Grundy
Resolution: Cannot Reproduce Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux x86_64, replica set, auth turned on


Issue Links:
Related
Operating System: ALL
Steps To Reproduce:

Drive a couple thousand connections to a repl set primary. Authenticate. Let the connections close. When all the connections are closed you will notice the resident size and non-mapped memory creeping up

Using the following python and shell scripts to build a connection, authenticate, then repeat:

#reader.py
 HOST = 'localhost'
 PORT = 27011
 conn = pymongo.Connection(host=HOST,port=PORT)
 db = conn.cranky
      for i in range(1):
	     doc = db.foo.find().where("this.field1 == " + str(int(i * i)) +";")
	     getl = list(doc)
      conn.disconnect()

Ran on 8 AWS guests to drive connections on the Primary:

#!/bin/bash
while [ 2 ]; do
	for i in $(seq 200); do
		python reader.py &
	done
	echo Waiting....
	wait
	echo Collected!
	sleep 20
done

Participants:

 Description   

Over time something related to replication is causing the resident memory set (apparently tied to non-mapped memory) to grow large enough to cause swapping.

After correlating different instances of this issue, we found that the easiest way to reproduce is to run a workload with server side javascript (map reduce, $where, group(), and eval() ). When v8 instance are instantiated, the non-mapped memory will balloon, this isn't the problem; though it is frequently the first thing that people notice. After running for a while you can see the non-mapped memory has increased over time (checking at times where there was no js running).

I don't think it is an allocation issue around v8 as it doesn't happen without replication.

You can see in the mongostat output how the non-mapped memory creeps up after each test run (Connections and memory burst up during a test run, then connections drop to 1 [mongostat] but the non-mapped memory increases a little each time):

insert  query update delete getmore command flushes mapped  vsize    res non-mapped faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn repl       time
    *0     *0     *0     *0       0     0|0       0  1.09g  2.72g    33m      1.63g      0 local:0.0%          0       0|0     0|0    12b   576b     1  REC   20:43:13
    *0     *0     *0     *0       0     0|0       0  1.09g  2.72g    33m      1.63g      0 local:0.0%          0       0|0     0|0    12b   576b     1  REC   20:43:18
    *0     *0     *0     *0       0  3828|0       0  1.09g  4.87g   119m      3.77g      0 admin:0.0%          0       0|0     0|0   341k   319k  2124  REC   20:43:38
    *0     *0     *0     *0       0  1136|0       0  1.09g  3.49g   107m       2.4g      0 admin:0.0%          0       0|0     0|0   102k    95k   724  REC   20:43:43
    *0     *0     *0     *0       0  3351|0       0  1.09g  3.32g   110m      2.23g      0 admin:0.0%          0       0|0     1|0   299k   279k   539  REC   20:43:48
    *0     *0     *0     *0       0   772|0       0  1.09g  2.82g   102m      1.73g      0 admin:0.0%          0       0|0     0|0    69k    64k     1  REC   20:44:23
    *0     *0     *0     *0       0     0|0       0  1.09g  2.82g   102m      1.73g      0 local:0.0%          0       0|0     0|0    12b   576b     1  REC   20:44:28
    *0     *0     *0     *0       0     0|0       0  1.09g  2.82g   102m      1.73g      0 local:0.0%          0       0|0     0|0    12b   576b     1  REC   20:44:33
    *0     *0     *0     *0       0  3905|0       0  1.09g  5.22g   144m      4.13g      0 admin:0.0%          0       0|0     0|0   348k   325k  2467  REC   20:44:58
    *0     *0     *0     *0       0   959|0       0  1.09g   3.3g   113m      2.21g      0 admin:0.0%          0       0|0     0|0    86k    80k   520  REC   20:45:03
    *0     *0     *0     *0       0  3903|0       0  1.09g  3.22g   116m      2.13g      0 admin:0.1%          0       0|0     0|0   348k   325k   439  REC   20:45:08
    *0     *0     *0     *0       0  1634|0       0  1.09g  4.34g   136m      3.25g      0 admin:0.0%          0       0|0     2|0   145k   136k  1575  REC   20:45:13
    *0     *0     *0     *0       0  1328|0       0  1.09g   4.3g   124m      3.21g      0 admin:0.0%          0       0|0     1|0   118k   111k  1540  REC   20:45:18
    *0     *0     *0     *0       0     0|0       0  1.09g  2.83g   110m      1.74g      0 local:0.0%          0       0|0     0|0    12b   576b     1  REC   20:45:48
    *0     *0     *0     *0       0     0|0       0  1.09g  2.83g   110m      1.74g      0 local:0.0%          0       0|0     0|0    12b   576b     1  REC   20:45:53
    *0     *0     *0     *0       0  3527|0       0  1.09g  9.36g   286m      8.27g      0 admin:0.0%          0       2|0     0|0   315k   294k  6540  REC   20:46:54
    *0     *0     *0     *0       0  1458|0       0  1.09g  2.92g   207m      1.83g      0 admin:0.0%          0       0|0     0|0   130k   122k     1  REC   20:46:59
    *0     *0     *0     *0       0  4003|0       0  1.09g  9.42g   282m      8.33g      0 admin:0.0%          0       0|0     4|0   357k   333k  6645  REC   20:48:19
    *0     *0     *0     *0       0  3425|0       0  1.09g  9.78g   315m      8.68g      0 admin:0.0%          0       0|0     0|0   306k   286k  7003  REC   20:48:24
    *0     *0     *0     *0       0     0|0       0  1.09g  2.95g   229m      1.85g      0 local:0.0%          0       0|0     0|0    12b   576b     1  REC   20:49:04
    *0     *0     *0     *0       0  3825|0       0  1.09g  9.47g   296m      8.38g      0 admin:0.0%          0       0|0     0|0   341k   319k  6652  REC   20:49:59
    *0     *0     *0     *0       0     0|0       0  1.09g  2.95g   227m      1.85g      0 local:0.0%          0       0|0     0|0    12b   576b     1  REC   20:50:30
    *0     *0     *0     *0       0  3412|0       0  1.09g    13g   358m      11.9g      0 admin:0.0%          0       0|0    10|0   304k   284k 10281  REC   20:53:31
    *0     *0     *0     *0       0   305|0       0  1.09g  2.97g   252m      1.88g      0 admin:0.0%          0       0|0     0|0    27k    26k     1  REC   20:56:01

I pulled out the highlights from a longer run so it's easier to spot



 Comments   
Comment by Michael Grundy [ 06/Mar/14 ]

This was opened to track a set of symptoms that were later found to be separate issues: SERVER-10596 and SERVER-10554. I think we can safely close this one out now.

Comment by Andy Schwerin [ 06/Mar/14 ]

michael.grundy@10gen.com, does the resource leak still repro?

Comment by Michael Grundy [ 02/Aug/13 ]

I've updated the description and reproduction steps with the most reliable way we've been able to reproduce this. Most of the reporters are using some kind of server side javascript, though at least one isn't.

Comment by Matt Vella [ 05/Jun/13 ]

One thing that was noted in CS-7099 was that the growth in non-mapped memory mostly correlated with the growth in accesses not in memory and page fault exceptions.

Are you testing in an environment where data is read from disk sometimes?

Comment by Michael Grundy [ 04/Jun/13 ]

The chances of this being related to auth alone are slim. I've run an extensive series of tests and what I initially thought was a reproduction, was the case that Spencer mentions above. The non-mapped memory set will grow as the connection count increases, but then stabilizes.

Additionally, I ran a series of multi day tests that consisted of high connection counts each running multiple aggregations. I ran variations of the test using multiple shells started from bash, parallel shells launched from js, and multithreaded python programs. After the initial warmup, the memory footprint was consistently stable.

Comment by Matt Vella [ 03/Jun/13 ]

Any updates here? This bug is keeping us from using mongo 2.4.x in production. Thanks.

Comment by Spencer Brody (Inactive) [ 17/May/13 ]

Here's the python program I used:

#!/usr/bin/python
import pymongo
import time
 
if __name__ == "__main__":
    first = True
    while True:
        print "starting next round"
        conns = []
        for i in range (5000):
            conn = pymongo.Connection(port=11111)
            conn.admin.authenticate('admin','password')
            conn.test.cities.aggregate([{"$group" : {"_id":'$stateId', 'totalPop' : {"$sum" : "$pop"}}}, {'$match' : {'totalPop' : {'$gte' : 50000}}}])
            conns.append(conn)
 
        print "created 5000 connections, closing them now"
        for conn in conns:
            conn.close()
        conns = []
        time.sleep(5)

Comment by Spencer Brody (Inactive) [ 17/May/13 ]

I just tested with a simple python program that creates 5000 connections, authenticates all of them, does a real simple aggregation on each one, then closes them and repeats.

I saw a little weird behavior in that after the first round of 5000 connections, when they were closed the res memory jumped from ~200MB to ~1GB. But after that it stabilized at 1.14g and has been stable for about 15 minutes now, so still no luck reproducing the real memory leak.

Comment by Michael Grundy [ 17/May/13 ]

Spencer had trouble reproducing this so we thought that maybe it is aggregation causing it. I tried reproducing it with a simpler test and didn't have the same leaky behavior. I then tried with an aggregation case I was using when I originally saw the behavior and did not have the same success. After reflecting on this, it occurs to me that after the initial issue replication, we were trying to reproduce this with lower connection counts.

The original reproduction was done with four machines driving connections and workload to the primary.

Comment by Spencer Brody (Inactive) [ 14/May/13 ]

Assigning back to Mike to confirm this is definitely an auth issue and get a consistent repro.

Comment by Daniel Pasette (Inactive) [ 13/May/13 ]

confirmed on both counts.

Comment by Andy Schwerin [ 13/May/13 ]

michael.grundy@10gen.com, Have you run tests to confirm a replicaset required? Authentication?

That is, if this were the test, would you see the memory leak without replication? With the .auth() line removed?

for (i=1; i<40; i++) {
    var repl2Conn = new Mongo("mongod1.realm99.10gen.me");
    var dba = repl2Conn.getDB('admin')
    dba.auth("dbAdmin","password")
    var testDB = repl2Conn.getDB('aggtest');
    testDB.foo.find({_id: 1});
}

Generated at Thu Feb 08 03:21:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.