[SERVER-6141] can't successfully replicate our shards anymore. replication isn't using memory efficiently and linux is invoking oom_killer to kill mongod. servers replicated earlier on same config (with smaller data sets) are still working fine... Created: 20/Jun/12 Updated: 09/Jul/16 Resolved: 15/Dec/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | John Albietz | Assignee: | Andy Schwerin |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
centos 6 |
||
| Issue Links: |
|
||||||||||||||||
| Operating System: | Linux | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
We have an issue with replication that is preventing us from successfully adding any new nodes to the replica sets in our main mongo sharded (4 shards) cluster... the team here is evaluating moving to a different db platform but I'm hoping that there's a patch or workaround that will allow us to continue growing the cluster without needing to boot up new boxes with double the RAM just to successfully replicate. From a new box (dual Core 8gb vm instance on Joyentcloud running centos 6), we installed the latest mongo 2.0.6 and started replication for the shard... the external sort finished but then mongo eventually crashed... here's the end of the mongo log from /var/log/mongo/mongod.log . (Note how fast it was going initially and then how slow it got near the end. took 23 minutes to get to 80% and then 3 hours or so to get from 80% to 89% before it crashed.) Wed Jun 20 02:04:51 [rsSync] external sort used : 113 files in 728 secs ntoreturn:1 reslen:44 300000ms ntoreturn:1 reslen:44 300000ms ntoreturn:1 reslen:44 300000ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300000ms ntoreturn:1 reslen:44 300000ms ntoreturn:1 reslen:44 300000ms ntoreturn:1 reslen:44 300000ms ntoreturn:1 reslen:44 300000ms ntoreturn:1 reslen:44 300024ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300000ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300000ms ntoreturn:1 reslen:44 300001ms ntoreturn:1 reslen:44 300001ms etc etc... and 4 hours later it's now up to ... 89%: Wed Jun 20 05:39:03 [conn23] command admin.$cmd command: { writebacklisten: ObjectId('4fbc4fe47fc06f85f192440a') } ntoreturn:1 reslen:44 300001ms Checking the /var/log/messages file, we saw that the linux oom_killer was being invoked and killing mongod. when mongo adds to a new member of a replica set, the data seems to transfer over just fine but then it runs out of memory when it is building its indexes & sort tables etc... We have other members of the replica set running with 8GB RAM / 1GB swap drive just fine... I've only experienced oom_killer being invoked once or twice ever, and it's only when stuff is REALLY bad on the server, and always always something configured wrong and running out of disk space and ram. mongo docs declare that mongo isn't supposed to do this. we aren't booting up small boxes here... this failure is happening on a dual core machine with 8GB ram and a 1GB swap disk. Mongo Docs http://www.mongodb.org/display/DOCS/Production+Notes#ProductionNotes-Swap "Swap It is useful for the linux kernel to have swap space to use in emergencies. Because of the way MongoDB memory maps the database files none of this data will ever end up in swap; this means that on a healthy system the swap space will rarely be used on a system only running MongoDB. Having swap can keep the kernel from killing MongoDB when physical memory limits are reached." http://www.mongodb.org/display/DOCS/The+Linux+Out+of+Memory+OOM+Killer "You may also want to look at using something which compresses swap/memory like compcache. MongoDB uses memory mapped files. The entire data is mapped. Over time, if there is no memory pressure, the mongod resident bytes may approach total memory, as the resident bytes includes file system cache bytes for the file pages open and touched by mongod." /var/log/messages and the oom info in the /var/log/messages file: Jun 20 08:05:36 c4fabbdf-f958-4e1f-b159-c6ff18df2020 kernel: [286991.177205] mongod invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 |
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 15/Dec/12 ] |
|
This this wasn't fixed by the previous change, please let us know. |
| Comment by John Albietz [ 11/Jul/12 ] |
|
Wow that sounds great. Thank you for the update. |
| Comment by Eliot Horowitz (Inactive) [ 11/Jul/12 ] |
|
We committed what we hoped fixes this issue for |
| Comment by Daniel Pasette (Inactive) [ 02/Jul/12 ] |
|
@ashish, could you provide a representative document from the data set which exhibits the problem? Can you provide collections stats, hardware details and index definitions as well? |
| Comment by Ashish Pandey [ 01/Jul/12 ] |
|
We see the same problem as we regularly rebuild our replica nodes to free up the disk space (compact does not free up disk space), and we have been able to consistently reproduce the issue This happens when the total size of an index goes over available memory (physical + swap) during phase 2 of index building. It appears as if phase 2 of index build likes to keep everything into memory until the commit in the end (at which point we start to see the memory usage tail off) We have gone around this by adding more swap (which is only ever used during index build on initial sync), but this is severely limiting and frustrating on a large data set Also, this is not data size dependent. We have replicated it on machines with less memory and a correspondingly smaller data set |
| Comment by John Albietz [ 28/Jun/12 ] |
|
Our _id keys don't get long. they're all BinData. These sound like related issues: https://jira.mongodb.org/browse/SERVER-2731 It sounds like updating the code that builds the indexes will likely eliminate this issue. Any way to go ahead with integrating the "new compact command" into the initial sync logic? |
| Comment by Andy Schwerin [ 25/Jun/12 ] |
|
Likely same problem. |
| Comment by Andy Schwerin [ 25/Jun/12 ] |
|
I don't think it's relevant, but how long do your _id keys get? Also, are they all BinData? |
| Comment by John Albietz [ 25/Jun/12 ] |
|
Andy, Thanks for the update... please let me know if you come up with anything. best, |
| Comment by Andy Schwerin [ 25/Jun/12 ] |
|
FYI, haven't forgotten this. I've needed to do a little research on our btree implementation, to check a hypothesis. |
| Comment by John Albietz [ 21/Jun/12 ] |
|
the index that is stalling is: Tue Jun 19 14:43:33 [rsSync] 100311571 objects cloned so far from 2000000/111540437 1% here's the output from one of the other members of the same shard db.identifiers.stats() { Here's an example of the type of data we have in the db. Nothing fancy or PRIMARY> db.identifiers.find().limit(1) { "date_created" : 1336641792, "_id" : BinData(2,"HgAAADM1MzE2MjM5MzIzMzYyMzkzMTYzMzc2NTM2MzQzMw=="), "t" : "nid", "d" : "1c7e6431c7e643", "date_modified" : 1336641792 }And here are the indexes: ] Hope this helps... |
| Comment by Andy Schwerin [ 21/Jun/12 ] |
|
I see you've got around 112 million documents in the collection you're indexing. Could you tell me a little more about (1) the schema of the documents in the collection, (2) which index is stalling out (it should be just above the mongo log lines you pasted in). For example, is it an index over a field that sometimes contains arrays? A geo index? I'm hoping to construct a repro in house. |
| Comment by John Albietz [ 21/Jun/12 ] |
|
Andy, thanks for the update. Do you need any more information to help track |
| Comment by Andy Schwerin [ 21/Jun/12 ] |
|
Still tracking down. There's an inappropriate use of virtual address space in the BSONObjExternalSorter ( |
| Comment by John Albietz [ 20/Jun/12 ] |
|
1. on the box a few minutes after starting the replication:
~]# iostat -xm avg-cpu: %user %nice %system %iowait %steal %idle Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util 3. reproducing it now with mongod with log level 2... it starts replication all over again after each failure so it'll take a little time... 4. mongo]# du -h disabling the oom-killer and increasing the swap disk isn't a tenable solution at this point. Replication will take a few days to finish with the way that it is set up now... let's fix the root issue. |
| Comment by siddharth.singh@10gen.com [ 20/Jun/12 ] |
|
1. I want to know how big is your database and what is the size of the index that you have. db.stats() will give stats on dataSize and index size for a particular db. I don't need the stats from every single database. I just need a rough number that will tell me how big your total database is and whats the total index size that you have. db.collection.stats() gives information about collections. If you could run that on some of your bigger collections it will give me an idea of the largest index size as well. Again, a ballpark number will do. 2. Output from free -m, iostat -xm 2 and 'cat /proc/sys/vm/overcommit_memory' 3. Can you please try reproducing the issue with log level 2 on the node that you are trying to resync and attach the logs to this ticket. Log level 2 will print more while extsort runs and will be helpful in seeing exactly what's going on. To run a mongo instance with higher verbosity level just pass an extra argument -vv on the command line when you start mongo. 4. Finally, please run du -h on the mongo data directory and post the details as well. I was suggesting a bigger swap file to get you guys move from the blocker state while we continue to work on debugging this issue. You can also try setting the value |
| Comment by John Albietz [ 20/Jun/12 ] |
|
thanks for the response: 1. ok... please tell me specifically what mongo commands you want to see the output from. A bigger swap file isn't the solution here... even if it works it'll make replication way too slow. During replication maybe there is a memory leak or you just aren't freeing up memory? It seems from the logs that there are different stages of replication. It would make sense to free up all allocated memory before moving on to the next stage of replication. |
| Comment by siddharth.singh@10gen.com [ 20/Jun/12 ] |
|
Hi, Can you please post the following: Finally, can you please try with a bigger swap file and report. It might be helpful in the meantime I go over the code and logs. Thanks. |