[SERVER-1593] Mongos segmentation faults Created: 09/Aug/10  Updated: 29/May/12  Resolved: 26/Aug/10

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 1.6.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Chris Wewerka Assignee: Alberto Lerner
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 5.3, Java Driver (Version from master 4.8.2010), dynamically linked mongos


Attachments: File core.13146.gz     File core.13527.gz     File core.1596.gz     File core.16675.gz     File core.17578.gz     File core.19827.gz     File core.27186.gz     File core.27454.gz     File core.28929.gz     File core.31157.gz     File core.31770.gz     File core.8804.gz     Zip Archive graph_mongos_crash.zip    
Operating System: Linux
Participants:

 Description   

mongos process crashes often in sharding environment with 2 shard servers, 1 config server, and 80 mongos processes (one for each appserver).

The segfault on one of the machines:

Aug 7 23:40:54 lo24-sv-13 kernel: mongos[30474]: segfault at
0000000000368d58 rip 00000034a52711d8 rsp 0000000067049d00 error 4

See http://groups.google.com/group/mongodb-user/browse_thread/thread/d959d31338205398 for further details



 Comments   
Comment by Chris Wewerka [ 06/Sep/10 ]

Sorry Alberto, it was decided to move away from mongoDb like also described in Server-1633 http://jira.mongodb.org/browse/SERVER-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17664#action_17664.

Comment by Alberto Lerner [ 26/Aug/10 ]

Chris, I still can't reproduce it here. In the past few days, there were several fixes that may as well addressed what you're seeing. By all means, if you see a crash again, please send us a report and we'd be glad to look at it. I'm putting this on hold in the meantime.

Comment by Alberto Lerner [ 23/Aug/10 ]

Hi, Crhis, how is the situation with the new nightly? If the crashes are still happening, could you send us the logs? We added coded to dump stack traces to them if something bad happens.

Comment by Alberto Lerner [ 19/Aug/10 ]

Chris, any progress on this?

Comment by Alberto Lerner [ 17/Aug/10 ]

The past entries in the JIRA were about a new way to report crashes in mongos. We tested them here and we got detailed stack traces even under very high memory pressure.

These changes will make it to tonight's nightly. Could you try it and report? We're hoping the log of a mongos crash will tell us more about the root causes. It would be great if you sent us those.

Comment by auto [ 17/Aug/10 ]

Author:

{'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}

Message: SERVER-1593 hook new signal handler to mongos
http://github.com/mongodb/mongo/commit/573e9682e3982d0ae54c3b635bb0b00077cc3cd2

Comment by auto [ 17/Aug/10 ]

Author:

{'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}

Message: SERVER-1593 log debugging symbols in the stack trace too
http://github.com/mongodb/mongo/commit/53ba584a157d87970390a927806d1e7e496a3224

Comment by auto [ 17/Aug/10 ]

Author:

{'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}

Message: SERVER-1593 SERVER-1593 signal handler should compile under windows too (take 2)
http://github.com/mongodb/mongo/commit/4b8ca5f9bd2ed54cc76aa27293a2a825fad0a8ad

Comment by auto [ 17/Aug/10 ]

Author:

{'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}

Message: SERVER-1593 SERVER-1593 signal handler should compile under windows too (take 2)
http://github.com/mongodb/mongo/commit/4b8ca5f9bd2ed54cc76aa27293a2a825fad0a8ad

Comment by auto [ 17/Aug/10 ]

Author:

{'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}

Message: SERVER-1593 signal handler should compile under windows too
http://github.com/mongodb/mongo/commit/153ae78ee1b920534b148f04bb0a3d97d01a5221

Comment by auto [ 17/Aug/10 ]

Author:

{'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}

Message: SERVER-1593 a new signal handler for abrupt exit cases
http://github.com/mongodb/mongo/commit/b07f1dacedd7498216e16a9339764e66c0b1770d

Comment by auto [ 16/Aug/10 ]

Author:

{'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}

Message: SERVER-1593 Expose log file descriptor (needed in new segv handler).
http://github.com/mongodb/mongo/commit/b1e95d9bee1b14993d56564648c5d54ed1507d58

Comment by Alberto Lerner [ 16/Aug/10 ]

Anything in particular I should look at?

Comment by Chris Wewerka [ 13/Aug/10 ]

Nagios checks for mongos crash at about Friday 16:00

Comment by Chris Wewerka [ 13/Aug/10 ]

The problem only occurs under really heavy load. We first tried only writes to mongodb and to mysql both, so we don't loose any data. (reads occur still from mysql)

I configured our central write "DAO" to use a java thread executor pool and as we reduced the pool size from 10 to 1 per appserver, the crashes we're much more seldom. But this won't work under heavy read load when we move from mysql reads to mongo reads since we can't use delayed reads with threads as we have to present sth. to the user fast.

I attached our mongos nagios checks for the last crash, and I see an impossible count of connections which can't be correct, because we allow only 150 Connections in the MongoOptions Object to the Mongos process per appserver. You can also see that the heap size grows very big. But I'm also not sure if the value e.g. for the inserts (taken from mongo shell stats and parsed by nagios) is correct, since it's so high

Comment by Alberto Lerner [ 13/Aug/10 ]

Unfortunately not much more information than the previous cores.

Can you tell how long it takes for a mongos to crash? Is it random or does it consistently crash after some time, certain number of events, etc. Are you by any chance monitoring memory foot print of these processes?

When the process crash do you see anything unusual in the logs before?

Is there anything in particular about the workload that you're using? If you run the workload you a controlled seeting, e.g., your laptop/dev machine, do you get a crash too?

Comment by Chris Wewerka [ 13/Aug/10 ]

Now we have a crash with the nightly version with more debugs. core.19827.gz

Comment by Alberto Lerner [ 12/Aug/10 ]

Would it be feasible to try the mongos from the nightly (unstable branch) in one of the app servers instead of 1.6.0? We made two changes that could help us gather more info. One is we report better the exception whose stack you posted yesterday on the JIRA. The other is we change a bit the stack trace code so to increase the chances of getting clean stacks on extreme memory pressure situations.

Comment by Eliot Horowitz (Inactive) [ 11/Aug/10 ]

It printed in verbose mode and you can get from "db.serverStatus()" look at the connections sub-object.

Comment by Chris Wewerka [ 11/Aug/10 ]

We also still have 7 appservers with mongos running on it, which say "too many open files", but the ulimit is 30000.

May it be possible to add s.th like Squids startup output also for mongos:

2010/08/11 10:32:25| Starting Squid Cache version 2.7.STABLE7 for x86_64-redhat-linux-gnu...
2010/08/11 10:32:25| Process ID 8386
2010/08/11 10:32:25| With 32768 file descriptors available
2010/08/11 10:32:25| Using epoll for the IO loop
...

Comment by Chris Wewerka [ 11/Aug/10 ]

more core dumps from other servers

Comment by Chris Wewerka [ 11/Aug/10 ]

more core dump from servers

Comment by Chris Wewerka [ 11/Aug/10 ]

more core dumps from other servers

Comment by Chris Wewerka [ 11/Aug/10 ]

More core dump from other servers

Comment by Chris Wewerka [ 11/Aug/10 ]

Core dump file for the above mentioned crash of mongos. Dump 13146

Comment by Chris Wewerka [ 11/Aug/10 ]

Alberto,

yes we have a lot of core dump files. Some of them are really big.

One mongos had also output on the console when crashing:

    • glibc detected *** /opt/mongo/bin/mongos: free(): invalid pointer: 0x0000000016b62de8 ***
      ======= Backtrace: =========
      /lib64/libc.so.6[0x351a46f4f4]
      /lib64/libc.so.6(cfree+0x8c)[0x351a472b1c]
      /usr/lib64/libstdc++.so.6(_ZNSs9_M_mutateEmmm+0x1a8)[0x365969d1e8]
      /usr/lib64/libstdc++.so.6(_ZNSs15_M_replace_safeEmmPKcm+0x2c)[0x365969d22c]
      /opt/mongo/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0x37c)[0x6507fc]
      /opt/mongo/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x240)[0x55ca70]
      /opt/mongo/bin/mongos(thread_proxy+0x80)[0x672880]
      /lib64/libpthread.so.0[0x351b8062f7]
      /lib64/libc.so.6(clone+0x6d)[0x351a4ce85d]
      ======= Memory map: ========
      00400000-00871000 r-xp 00000000 68:05 288007 /opt/mongodb-linux-x86_64-debugsymbols-1.6.0/bin/mongos
      00a70000-00a85000 rw-p 00470000 68:05 288007 /opt/mongodb-linux-x86_64-debugsymbols-1.6.0/bin/mongos
        • a lot more memory dump *****
          7de64000-7e864000 rw-p 7de64000 00:00 0
          7e864000-7e865000 ---p 7e864000 00:00 0
          7e865000-7f265000 rw-p 7e865000 00:00 0
          7f265000-7f266000 ---p 7f265000 00:00 0
          7f266000-7fc66000 rw-p 7f266000 00:00 0
          351a000000-351a01a000 r-xp 00000000 68:05 1056002 /lib64/ld-2.5.so
          351a219000-351a21a000 r--p 00019000 68:05 1056002 /lib64/ld-2.5.so
          351a21a000-351a21b000 rw-p 0001a000 68:05 1056002 /lib64/ld-2.5.so
          351a400000-351a546000 r-xp 00000000 68:05 1056005 /lib64/libc-2.5.so
          351a546000-351a746000 ---p 00146000 68:05 1056005 /lib64/libc-2.5.so
          351a746000-351a74a000 r--p 00146000 68:05 1056005 /lib64/libc-2.5.so
          351a74a000-351a74b000 rw-p 0014a000 68:05 1056005 /lib64/libc-2.5.so
          351a74b000-351a750000 rw-p 351a74b000 00:00 0
          351b400000-351b482000 r-xp 00000000 68:05 1056022 /lib64/libm-2.5.so
          351b482000-351b681000 ---p 00082000 68:05 1056022 /lib64/libm-2.5.so
          351b681000-351b682000 r--p 00081000 68:05 1056022 /lib64/libm-2.5.so
          351b682000-351b683000 rw-p 00082000 68:05 1056022 /lib64/libm-2.5.so
          351b800000-351b815000 r-xp 00000000 68:05 1056016 /lib64/libpthread-2.5.so
          351b815000-351ba14000 ---p 00015000 68:05 1056016 /lib64/libpthread-2.5.so
          351ba14000-351ba15000 r--p 00014000 68:05 1056016 /lib64/libpthread-2.5.so
          351ba15000-351ba16000 rw-p 00015000 68:05 1056016 /lib64/libpthread-2.5.so
          351ba16000-351ba1a000 rw-p 351ba16000 00:00 0
          3659200000-365920d000 r-xp 00000000 68:05 1056017 /lib64/libgcc_s-4.1.2-20070626.so.1
          365920d000-365940d000 ---p 0000d000 68:05 1056017 /lib64/libgcc_s-4.1.2-20070626.so.1
          365940d000-365940e000 rw-p 0000d000 68:05 1056017 /lib64/libgcc_s-4.1.2-20070626.so.1
          3659600000-36596e6000 r-xp 00000000 68:03 457437 /usr/lib64/libstdc++.so.6.0.8
          36596e6000-36598e5000 ---p 000e6000 68:03 457437 /usr/lib64/libstdc++.so.6.0.8
          36598e5000-36598eb000 r--p 000e5000 68:03 457437 /usr/lib64/libstdc++.so.6.0.8
          36598eb000-36598ee000 rw-p 000eb000 68:03 457437 /usr/lib64/libstdc++.so.6.0.8
          36598ee000-3659900000 rw-p 36598ee000 00:00 0
          2aaaaaaab000-2aaaaaaae000 rw-p 2aaaaaaab000 00:00 0
          2aaaaaab3000-2aaaaaab7000 rw-p 2aaaaaab3000 00:00 0
          2aaaaaab7000-2aaaaaab8000 ---p 2aaaaaab7000 00:00 0
          2aaaaaab8000-2aaaab4b8000 rwxp 2aaaaaab8000 00:00 0
          2aaaab4b8000-2aaaab4b9000 ---p 2aaaab4b8000 00:00 0
          2aaaab4b9000-2aaaabeb9000 rwxp 2Aborted (core dumped)
Comment by Alberto Lerner [ 10/Aug/10 ]

Chris,

Any new occurrence of this problem?

Alberto.

Comment by auto [ 09/Aug/10 ]

Author:

{'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}

Message: SERVER-1593 Increase chances of getting a trace under memory shortage
http://github.com/mongodb/mongo/commit/4a0c8053f2ec64b28b48e91d334386582a362b59

Comment by Alberto Lerner [ 09/Aug/10 ]

Chris,

From the thread you sent, I'm assuming you're running on 1.6.0. Is there any chance we can get a stack trace or a core dump or the problem in that version?

In the meantime, I can look at the stacks you sent for 1.5.7 and check if that's a problem we have corrected along the way.

Thanks,
Alberto.

Generated at Thu Feb 08 02:57:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.