[SERVER-14970] Memory leak or strange allocation problem Created: 20/Aug/14  Updated: 20/May/15  Resolved: 20/May/15

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 2.6.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mircea Danila Dumitrescu Assignee: J Rassi
Resolution: Incomplete Votes: 6
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Related
Operating System: ALL
Participants:

 Description   

We have a pretty sweet setup that has been running nicely for 2 years+
3 shards - 2 replica sets each + 1 arbiter

all 6 replica sets, but 3 the arbiters are running on 240 GB machines in AWS (r3.8xlarge)

Each replica set has 10 disks in raid 0 on standard EBS, 50 GB each. We are at the limit of the 500GB (10x50 GB) and we are looking to increase this by using the new disks amazon is providing - SSD with guaranteed IOPS over EBS. Still RAID 0 with 5x200GB disks.

On shard 1 I have added another replica set (exact same server - 240GB RAM/32 CPU), to increase disk size a bit (as explained above).

Here is a db.stats from the main database on shard1:

{
	"db" : "xxxxxxxxx",
	"collections" : 152,
	"objects" : 487562429,
	"avgObjSize" : 272.9380673915709,
	"dataSize" : 133074347104,
	"storageSize" : 149098192336,
	"numExtents" : 1041,
	"indexes" : 288,
	"indexSize" : 63420994896,
	"fileSize" : 238188232704,
	"nsSizeMB" : 16,
	"dataFileVersion" : {
		"major" : 4,
		"minor" : 5
	},
	"extentFreeList" : {
		"num" : 90,
		"totalSize" : 11087396848
	},
	"ok" : 1
}

What we noticed after the sync:

If we make the new server primary, the website is still semi responsive, but the heavy pages die. Looking at the database, it starts using 3200% cpu - all cpus at 100% and memory usage starts increasing by roughly 1% every 5 seconds or so. I am talking about memory assigned to mongod, not the cached memory in the kernel - which sits nicely around 69GB.

I noticed very slow 90 seconds queries scanning through millions of records, although we have indexes for those complex queries which work fine on the other replica set member (previous primary). Seems the indexes are not used. I assumed the indexes are not in memory, so I touched them on that collections (both index and data) - http://docs.mongodb.org/manual/reference/command/touch/ to no avail.

I also assumed it might be using so much memory due to complex sorts, but that memory should be freed asap, so I tried changing the allocator to jemalloc. I am under impression things are a bit better, but I am not 100% sure.
In the end if I do not switch the primary back, the database ends up using all the memory and the kernel will kill it as seen here: http://pastebin.com/teMT5eqv

Please let me know if any other information might throw some light on things.

Also, I have posted this on the google group and someone pointed out my version (2.6.4) coming out recently. I have since reverted to 2.6.3 and everything is going nicely. I did not try to upgrade all servers to 2.6.4, so this might still be related to different versions running in the same replica set, but I think they should be compatible 2.6.3 with 2.6.4. It was a minor, not a major update.



 Comments   
Comment by Ramon Fernandez Marina [ 20/May/15 ]

I'm resolving this ticket as incomplete since we haven't heard back from the original reporter. I realize others were affected as well, so if anyone is experiencing this or similar issues and after upgrading to 2.6.10 the issues still persist I'd like to ask that new tickets are opened to avoid confusion.

Thanks,
Ramón.

Comment by Daniel Pasette (Inactive) [ 26/Dec/14 ]

Sorry for the delay in response on this ticket.
For the problems with upgrading to 2.6.4, it's a little hard to tell with the evidence given by venatir, but it seems like issues with the query planner were causing the wrong index to be chosen. It would probably be possible to figure out the exact issue if you can tell us which query was slow and the indexes on that collection.

Here is a list of all the query issues resolved in 2.6.5 and 2.6.4.

Comment by Dmitry [ 25/Dec/14 ]

I have the same issue!

I have tried to add a new member(5th) to one of mongodb replica set. Data sync has been completed, but in oplog sync stage master have started using more cpu(100%) and many of requests on the backend has been timeouted.

Another indications (on master):

  • flush avg is higher than normal
  • lock % is higer
  • read queue > 4K
  • non-mapped virtual memory increased from from 15Gb to 18Gb
  • Connections count increased from 11K to 15K

configured oplog size rs4: 51200MB (36h)
mongodb 2.6.5
the cluseter have 6 replica sets with 4 members each (i2.2xlarge)

Comment by Ofer Cohen [ 29/Sep/14 ]

Any news about this issue?
This issue is critical 2.6.4 and make a lot of environments to avoid upgrade to 2.6.4 and stay with 2.6.3.

Comment by Mircea Danila Dumitrescu [ 16/Sep/14 ]

Hi, sorry for the delay ... I will give a more detailed explanation.

We were encountering the problem described above initially by noticing that a lot of slow queries are not using indexes as they should.

After we downgraded to 2.6.3 things were fine, but we realised that we never took a look at the config servers logs.
Turned out that one of the config server was dead for a while. I am not sure if this caused the memory leak, but I had to mention it.

Anyway, in 2.6.3 things are fine now.

We do not want to upgrade, but we built a similar environment - 3 shards / just with primaries this time - all on the same test machine. This test environment is running 2.6.4 and was running ok. We were almost at the point of blaming the OOM issue on the bad config server, when yesterday we noticed slow queries again.
It seems that one of the shards is taking a long time to perform a simple query. It seems to be choosing the wrong index, although explain returns instantly using the right index. We tried hinting the correct index and it is instant.
Rebuilding all indexes on that database also solved the problem yesterday, but it came back.
We were using the old 1.4 php drivers, so we immediately upgraded to 1.5.5 but the problem is still there.

I am not sure if what I am describing is the same problem as before, but the initial symptoms are the same.

Comment by J Rassi [ 04/Sep/14 ]

alexandrebini, would you be willing to provide the additional information I requested from Mircea above, for your case of the issue? Quoting my requests below:

Could you upload the full mongod log for the host that suffered the OOM condition?

Supposing the root cause of the issue indeed is a memory leak that affects 2.6.4, a verbose mongod log would be helpful for narrowing down the set of operations that could be affected. Would you be willing to re-upgrade this host to 2.6.4 and capture a verbose log (requires adding "-v" to the mongod command-line arguments) while reproducing the issue, to assist in the diagnosis?

~ Jason Rassi

Comment by Alexandre Bini [ 04/Sep/14 ]

Hi,

We had exactly the same problem here. Downgrade to 2.6.3 solve the memory leak problem, but, we have a new problem with high cpu usage. Downgrade to 2.4 solve the problem.

Thanks

Comment by Ramon Fernandez Marina [ 03/Sep/14 ]

venatir, is this still an issue for you? If the answer is yes, can you please provide the information rassi@10gen.com requested above?

Thanks,
Ramón.

Comment by J Rassi [ 25/Aug/14 ]

Hi Mircea,

We still need additional information from you to diagnose this problem. Do you still have a copy of the mongod log for the host that crashed? And, would you be willing to re-upgrade to 2.6.4 in order to help reproduce this problem?

Thanks.
~ Jason Rassi

Comment by J Rassi [ 20/Aug/14 ]

Hi Mircea,

Could you upload the full mongod log for the host that suffered the OOM condition?

Supposing the root cause of the issue indeed is a memory leak that affects 2.6.4, a verbose mongod log would be helpful for narrowing down the set of operations that could be affected. Would you be willing to re-upgrade this host to 2.6.4 and capture a verbose log (requires adding "-v" to the mongod command-line arguments) while reproducing the issue, to assist in the diagnosis?

Thanks.

~ Jason Rassi

Generated at Thu Feb 08 03:36:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.