Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-14970

Memory leak or strange allocation problem

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.6.4
    • Component/s: Stability
    • Labels:
      None
    • ALL

      We have a pretty sweet setup that has been running nicely for 2 years+
      3 shards - 2 replica sets each + 1 arbiter

      all 6 replica sets, but 3 the arbiters are running on 240 GB machines in AWS (r3.8xlarge)

      Each replica set has 10 disks in raid 0 on standard EBS, 50 GB each. We are at the limit of the 500GB (10x50 GB) and we are looking to increase this by using the new disks amazon is providing - SSD with guaranteed IOPS over EBS. Still RAID 0 with 5x200GB disks.

      On shard 1 I have added another replica set (exact same server - 240GB RAM/32 CPU), to increase disk size a bit (as explained above).

      Here is a db.stats from the main database on shard1:

      {
      	"db" : "xxxxxxxxx",
      	"collections" : 152,
      	"objects" : 487562429,
      	"avgObjSize" : 272.9380673915709,
      	"dataSize" : 133074347104,
      	"storageSize" : 149098192336,
      	"numExtents" : 1041,
      	"indexes" : 288,
      	"indexSize" : 63420994896,
      	"fileSize" : 238188232704,
      	"nsSizeMB" : 16,
      	"dataFileVersion" : {
      		"major" : 4,
      		"minor" : 5
      	},
      	"extentFreeList" : {
      		"num" : 90,
      		"totalSize" : 11087396848
      	},
      	"ok" : 1
      }
      

      What we noticed after the sync:

      If we make the new server primary, the website is still semi responsive, but the heavy pages die. Looking at the database, it starts using 3200% cpu - all cpus at 100% and memory usage starts increasing by roughly 1% every 5 seconds or so. I am talking about memory assigned to mongod, not the cached memory in the kernel - which sits nicely around 69GB.

      I noticed very slow 90 seconds queries scanning through millions of records, although we have indexes for those complex queries which work fine on the other replica set member (previous primary). Seems the indexes are not used. I assumed the indexes are not in memory, so I touched them on that collections (both index and data) - http://docs.mongodb.org/manual/reference/command/touch/ to no avail.

      I also assumed it might be using so much memory due to complex sorts, but that memory should be freed asap, so I tried changing the allocator to jemalloc. I am under impression things are a bit better, but I am not 100% sure.
      In the end if I do not switch the primary back, the database ends up using all the memory and the kernel will kill it as seen here: http://pastebin.com/teMT5eqv

      Please let me know if any other information might throw some light on things.

      Also, I have posted this on the google group and someone pointed out my version (2.6.4) coming out recently. I have since reverted to 2.6.3 and everything is going nicely. I did not try to upgrade all servers to 2.6.4, so this might still be related to different versions running in the same replica set, but I think they should be compatible 2.6.3 with 2.6.4. It was a minor, not a major update.

            Assignee:
            rassi J Rassi
            Reporter:
            venatir Mircea Danila Dumitrescu
            Votes:
            6 Vote for this issue
            Watchers:
            19 Start watching this issue

              Created:
              Updated:
              Resolved: