[SERVER-6785] Possible mongos memory leak Created: 16/Aug/12  Updated: 11/Jul/16  Resolved: 21/Aug/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.0-rc1
Fix Version/s: 2.2.0-rc2, 2.3.0

Type: Bug Priority: Major - P3
Reporter: Jonathan Schneider Assignee: Ben Becker
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 6.2


Attachments: File investigation-mongos-with-google-profiler.tgz     PDF File momngos-t1.pdf     PDF File momngos-t2.pdf     PDF File momngos-t3.pdf     File mongos.tgz    
Issue Links:
Depends
Duplicate
duplicates SERVER-6354 mongos all memory Closed
Related
Operating System: Linux
Participants:

 Description   

I setup a 2 shard cluster with auth. With a continuous query load of around 1000/sec the mongos process seems to keep chewing up memory until the OS kills the process:

Out of memory: Kill process 2567 (mongos) score 950 or sacrifice child
Killed process 2567, UID 502, (mongos) total-vm:8987492kB, anon-rss:7738860kB, file-rss:16kB

In this case I had upgraded the test server to 8GB of ram to make sure it would still occur and with 8GB it took right around an hour to use up all the memory and swap space before the OS killed it.



 Comments   
Comment by Azat Khuzhin [ 22/Aug/12 ]

I have about the same results.
Thanks.

Comment by Jonathan Schneider [ 22/Aug/12 ]

I compiled the nightly with the changes and initial testing looks good, I cranked my test up to 2k queries/sec and am unable to duplicate the memory leak. Mongos remained at about 1% memory usage on an 8G test server. Thanks guys.

Comment by Azat Khuzhin [ 21/Aug/12 ]

I'v compile new binaries with Eliot changes, and upgrade mongos on server.
Will see...

Comment by auto [ 21/Aug/12 ]

Author:

{u'date': u'2012-08-21T15:28:40-07:00', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}

Message: SERVER-6785 fix leaking parallel cursor on unsharded reads
Branch: v2.2
https://github.com/mongodb/mongo/commit/089f96e956634cf922fd3dc99681d7f56318502a

Comment by auto [ 21/Aug/12 ]

Author:

{u'date': u'2012-08-21T15:28:40-07:00', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}

Message: SERVER-6785 fix leaking parallel cursor on unsharded reads
Branch: master
https://github.com/mongodb/mongo/commit/987f61202a6672867f96f87dffc560d19efef0de

Comment by Greg Studer [ 21/Aug/12 ]

We've tracked one issue down - it's related to querying non-sharded collections via mongos. There's not a build yet (soon), but if this is the only problem you're experiencing you should see stable mongos memory usage if you shard all collections (for example, by _id) before repeatedly querying them in the tests.

Comment by Jonathan Schneider [ 21/Aug/12 ]

Glad the problem has been observed, let us know how it goes, I'd love to test a new build that might correct this issue.

Comment by Ben Becker [ 21/Aug/12 ]

Thanks Azat,

I was able to successfully generate heap profiles based on this binary. The issue indeed appears to be related to memory ownership between the ShardCursor (DBClientCursor) and ParallelSortClusteredCursor. I've attached the heap profiler results in pdf format.

Comment by Azat Khuzhin [ 21/Aug/12 ]

Hi Ben,

Here it is

Comment by Ben Becker [ 21/Aug/12 ]

Hi Azat,

Could you also supply the mongos binary used to generate the heap profiler output?

Thanks,
Ben

Comment by Azat Khuzhin [ 21/Aug/12 ]

You are welcome.
Last commit from "github.com/mongodb/mongo" is "9e1a79dc4a2cc632d45e6af67f574c41aeff54ac".

Comment by Ben Becker [ 20/Aug/12 ]

Hi Azat,

Thank you for posting the results. I'm analyzing now. Could you post the git hash of the mongos instance you ran with?

Thanks,
-Ben

Comment by Azat Khuzhin [ 20/Aug/12 ]

I'v attach my results

Comment by Azat Khuzhin [ 20/Aug/12 ]

I'v already handle it myself

Comment by Azat Khuzhin [ 20/Aug/12 ]

Ben I could try gperf tool, if you tell how.
I compile mongos with '--heapcheck' but file, that I set via HEAPPROFILE environment variable, not created after mongo is shutown

Comment by Jonathan Schneider [ 17/Aug/12 ]

Yes, I wiped the logfile before starting the test. I will see how much I can compress it...

If its not done before I have to leave today I will get it over Monday. I doubt it will get under 150Meg, so if you could email me the SCP creds that would be great. The email on my jira account here will work fine.

Comment by Ben Becker [ 17/Aug/12 ]

Hi Jonathan,

Was just the last run 8 gigs? Could you try compressing and attaching to this ticket (limit is 150mb)? Otherwise I will send you an scp address for uploading the files.

Thanks!
-Ben

Comment by Jonathan Schneider [ 17/Aug/12 ]

Reproduced,
TOP output right before it was killed

top - 09:02:42 up  1:10,  2 users,  load average: 0.12, 0.15, 0.07
Tasks: 132 total,   2 running, 130 sleeping,   0 stopped,   0 zombie
Cpu(s): 17.3%us, 10.6%sy,  0.0%ni, 62.0%id,  0.5%wa,  0.0%hi,  9.6%si,  0.0%st
Mem:   8062108k total,  7937760k used,   124348k free,      124k buffers
Swap:  1048568k total,  1048564k used,        4k free,    13712k cached
 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2815 mongodb   20   0 8685m 7.3g 2384 S 89.4 95.0  43:26.36 mongos

mongostat output right before it was killed:

insert  query update delete getmore command  vsize    res faults  netIn netOut  conn repl       time
     0    944    217      0       0     159  8.46g  7.29g      0   168k   199k   287  RTR   09:02:36
     0    900    208      0       1     162  8.47g  7.29g      5   163k     2m   287  RTR   09:02:37
     0   1016    206      1       1     157  8.47g  7.29g      0   172k   269k   287  RTR   09:02:38
     0    454    117      0       2     106  8.47g  7.29g      0    99k   236k   287  RTR   09:02:39
     0   1028    223      0       2     161  8.48g   7.3g      0   177k   349k   287  RTR   09:02:40
     0   1143    194      0       0     147  8.48g   7.3g      0   175k   228k   287  RTR   09:02:41
     0    815    169      0       0     119  8.48g   7.3g      1   134k   164k   287  RTR   09:02:42
     0    990    197      0       1     152  8.48g  7.31g      0   156k   257k   287  RTR   09:02:43
     0   1077    206      0       2     165  8.49g  7.31g     30   193k   355k   287  RTR   09:02:45
     0    988    190      1       1     139  8.49g  7.31g     24   163k   278k   287  RTR   09:02:46
insert  query update delete getmore command  vsize    res faults  netIn netOut  conn repl       time
     0    894    170      0       0     138  8.49g  7.32g     48   147k   189k   287  RTR   09:02:47
     0    967    180      0       1     127   8.5g  7.32g     58   148k   267k   287  RTR   09:02:48
     0    922    195      0       2     154   8.5g  7.32g     37   156k   321k   287  RTR   09:02:49
     0    924    192      0       1     158   8.5g  7.32g     46   155k   280k   287  RTR   09:02:50
     0    658    115      0       2     105   8.5g  7.33g     12   113k   360k   287  RTR   09:02:51
Fri Aug 17 09:02:52 Socket recv() errno:104 Connection reset by peer 127.0.0.1:27017
Fri Aug 17 09:02:52 SocketException: remote: 127.0.0.1:27017 error: 9001 socket exception [1] server [127.0.0.1:27017]
Fri Aug 17 09:02:52 DBClientCursor::init call() failed

The log is 8GB, how do you want it?

Comment by Jonathan Schneider [ 17/Aug/12 ]

I have restarted the service with the --vvvvv option and unleashed the load, will take a hour or so to kill it.

Since the logfile is extremely verbose is there anything in particular I should be looking for, or do you want a copy of the whole thing after it gets killed?

I am only here a half day today but could try the memory profiling on Monday.

Comment by Ben Becker [ 17/Aug/12 ]

Hi Jonathan,

If this is repeatable, would you be able to supply verbose log from mongos by running with '--vvvvv'?

Also, would it be possible to try a debug build of mongos and provide results from the google perf tools heap profiler? If so, I will provide the binary and instructions for collecting heap profile data.

Many Thanks!
-Ben

Generated at Thu Feb 08 03:12:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.