[SERVER-1593] Mongos segmentation faults Created: 09/Aug/10 Updated: 29/May/12 Resolved: 26/Aug/10 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 1.6.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Chris Wewerka | Assignee: | Alberto Lerner |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
CentOS 5.3, Java Driver (Version from master 4.8.2010), dynamically linked mongos |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
mongos process crashes often in sharding environment with 2 shard servers, 1 config server, and 80 mongos processes (one for each appserver). The segfault on one of the machines: Aug 7 23:40:54 lo24-sv-13 kernel: mongos[30474]: segfault at See http://groups.google.com/group/mongodb-user/browse_thread/thread/d959d31338205398 for further details |
| Comments |
| Comment by Chris Wewerka [ 06/Sep/10 ] |
|
Sorry Alberto, it was decided to move away from mongoDb like also described in Server-1633 http://jira.mongodb.org/browse/SERVER-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17664#action_17664. |
| Comment by Alberto Lerner [ 26/Aug/10 ] |
|
Chris, I still can't reproduce it here. In the past few days, there were several fixes that may as well addressed what you're seeing. By all means, if you see a crash again, please send us a report and we'd be glad to look at it. I'm putting this on hold in the meantime. |
| Comment by Alberto Lerner [ 23/Aug/10 ] |
|
Hi, Crhis, how is the situation with the new nightly? If the crashes are still happening, could you send us the logs? We added coded to dump stack traces to them if something bad happens. |
| Comment by Alberto Lerner [ 19/Aug/10 ] |
|
Chris, any progress on this? |
| Comment by Alberto Lerner [ 17/Aug/10 ] |
|
The past entries in the JIRA were about a new way to report crashes in mongos. We tested them here and we got detailed stack traces even under very high memory pressure. These changes will make it to tonight's nightly. Could you try it and report? We're hoping the log of a mongos crash will tell us more about the root causes. It would be great if you sent us those. |
| Comment by auto [ 17/Aug/10 ] |
|
Author: {'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}Message: |
| Comment by auto [ 17/Aug/10 ] |
|
Author: {'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}Message: |
| Comment by auto [ 17/Aug/10 ] |
|
Author: {'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}Message: |
| Comment by auto [ 17/Aug/10 ] |
|
Author: {'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}Message: |
| Comment by auto [ 17/Aug/10 ] |
|
Author: {'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}Message: |
| Comment by auto [ 17/Aug/10 ] |
|
Author: {'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}Message: |
| Comment by auto [ 16/Aug/10 ] |
|
Author: {'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}Message: |
| Comment by Alberto Lerner [ 16/Aug/10 ] |
|
Anything in particular I should look at? |
| Comment by Chris Wewerka [ 13/Aug/10 ] |
|
Nagios checks for mongos crash at about Friday 16:00 |
| Comment by Chris Wewerka [ 13/Aug/10 ] |
|
The problem only occurs under really heavy load. We first tried only writes to mongodb and to mysql both, so we don't loose any data. (reads occur still from mysql) I configured our central write "DAO" to use a java thread executor pool and as we reduced the pool size from 10 to 1 per appserver, the crashes we're much more seldom. But this won't work under heavy read load when we move from mysql reads to mongo reads since we can't use delayed reads with threads as we have to present sth. to the user fast. I attached our mongos nagios checks for the last crash, and I see an impossible count of connections which can't be correct, because we allow only 150 Connections in the MongoOptions Object to the Mongos process per appserver. You can also see that the heap size grows very big. But I'm also not sure if the value e.g. for the inserts (taken from mongo shell stats and parsed by nagios) is correct, since it's so high |
| Comment by Alberto Lerner [ 13/Aug/10 ] |
|
Unfortunately not much more information than the previous cores. Can you tell how long it takes for a mongos to crash? Is it random or does it consistently crash after some time, certain number of events, etc. Are you by any chance monitoring memory foot print of these processes? When the process crash do you see anything unusual in the logs before? Is there anything in particular about the workload that you're using? If you run the workload you a controlled seeting, e.g., your laptop/dev machine, do you get a crash too? |
| Comment by Chris Wewerka [ 13/Aug/10 ] |
|
Now we have a crash with the nightly version with more debugs. core.19827.gz |
| Comment by Alberto Lerner [ 12/Aug/10 ] |
|
Would it be feasible to try the mongos from the nightly (unstable branch) in one of the app servers instead of 1.6.0? We made two changes that could help us gather more info. One is we report better the exception whose stack you posted yesterday on the JIRA. The other is we change a bit the stack trace code so to increase the chances of getting clean stacks on extreme memory pressure situations. |
| Comment by Eliot Horowitz (Inactive) [ 11/Aug/10 ] |
|
It printed in verbose mode and you can get from "db.serverStatus()" look at the connections sub-object. |
| Comment by Chris Wewerka [ 11/Aug/10 ] |
|
We also still have 7 appservers with mongos running on it, which say "too many open files", but the ulimit is 30000. May it be possible to add s.th like Squids startup output also for mongos: 2010/08/11 10:32:25| Starting Squid Cache version 2.7.STABLE7 for x86_64-redhat-linux-gnu... |
| Comment by Chris Wewerka [ 11/Aug/10 ] |
|
more core dumps from other servers |
| Comment by Chris Wewerka [ 11/Aug/10 ] |
|
more core dump from servers |
| Comment by Chris Wewerka [ 11/Aug/10 ] |
|
more core dumps from other servers |
| Comment by Chris Wewerka [ 11/Aug/10 ] |
|
More core dump from other servers |
| Comment by Chris Wewerka [ 11/Aug/10 ] |
|
Core dump file for the above mentioned crash of mongos. Dump 13146 |
| Comment by Chris Wewerka [ 11/Aug/10 ] |
|
Alberto, yes we have a lot of core dump files. Some of them are really big. One mongos had also output on the console when crashing:
|
| Comment by Alberto Lerner [ 10/Aug/10 ] |
|
Chris, Any new occurrence of this problem? Alberto. |
| Comment by auto [ 09/Aug/10 ] |
|
Author: {'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}Message: |
| Comment by Alberto Lerner [ 09/Aug/10 ] |
|
Chris, From the thread you sent, I'm assuming you're running on 1.6.0. Is there any chance we can get a stack trace or a core dump or the problem in that version? In the meantime, I can look at the stacks you sent for 1.5.7 and check if that's a problem we have corrected along the way. Thanks, |