[SERVER-8105] SIGSEGV in db/clientcursor.cpp Created: 08/Jan/13 Updated: 08/Mar/13 Resolved: 12/Feb/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 2.0.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Braam van Heerden | Assignee: | James Wahlin |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Solaris 10u9 amd64 |
||
| Attachments: |
|
| Operating System: | Solaris |
| Steps To Reproduce: | Cannot reproduce in the lab |
| Participants: |
| Description |
|
I have an issue with a Mongo 2.0.7 installation. Currently this DB is used only to cache data for communication between unrelated processes. The data is not sharded or replicated, and there is one instance running on each computer. There is a seperate replicated sharded set on the same set of computers, but we cannot replicate the issue in the shardeded set, only in the standalone instance. From time to time (usually about once every 2 months or so), we are detecting that the Mongo local instance has gone down, interrupting IPC for our application suite, and also losing some client notifications. Unfortunately we have not yet been able to identify a series of environmental or data conditions that can trigger this. Unfortunately we also cannot get a SIGSEGV to generate a stack trace we can look at on Solaris. Instead, we have re-compiled the 2.0.7 source and added our own stack trace routines into the code to try and determine the exact point of the failure. I have attached all the stack traces we have to this ticket. The stack traces are along these lines:
From what we can determine, the issue is on line 443 in db/clientcursor.cpp, where an invalid pointer (not NULL) is used:
This then subsequently causes a SIGSEGV in db/clientcursor.cpp line 512:
Unfortunately we cannot determine why the pointer becomes invalid, and what conditions are needed for this. |
| Comments |
| Comment by James Wahlin [ 12/Feb/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Braam, At this point it looks like we are stuck without a way to reproduce. I am closing this ticket but please feel free to reopen if you can reproduce or have additional information. Thanks, | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tad Marshall [ 22/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
The nice thing that addr2line can do is to show you inlined code that usually isn't visible in the stack trace. I don't know if Solaris is different from Linux in this regard, but on Linux addr2line will show me 20 routines in the call stack from a list of 12 addresses that I pass to addr2line. The 8 "extra" routines are inlined functions that are inferred from the debug information by addr2line. The most common case we have seen of segfaults is holding pointers to objects that can be freed by other threads. Your segfault seems to be specifically in code that is deliberately touching a record in a memory-mapped database file in order to page-fault it into RAM if it is not already there. The best explanation I can think of is that somehow the database was closed while being accessed, which is supposed to be prevented by locks. It is possible that there is information in a log file that would show if a file was closed immediately before the segfault, but this is just guessing at a cause. We have not seen reports of crashes like this, so unless we can find a way to reproduce the crash, this may be hard to debug. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Iwan Aucamp [ 22/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
We do not have debug symbols, but what we normally do is just compare the instructions @ the instruction offset in stack trace to ones in a binary compiled with debugging, here however its not really needed. Stack trace shows that the first instruction in _ZN5mongo6Record5touchEb was being excueted as there is no offset. The disassembled content of _ZN5mongo6Record5touchEb is :
Source of Record::touch
Class definition for Record
Now what its trying to do in 82dec0 is get the value of lengthWithHeaders into eax so it can be compared with HeaderSize. Given this generates a segmentation fault its fair to assume that rdi contained an invalid address - which originated from the caller (ClientCursor::staticYield line 512) as caller sets rdi up with the class base pointer (i.e. this in C++). And since the only way to get from _ZN5mongo12ClientCursor5yieldEiPNS_6RecordE to _ZN5mongo6Record5touchEb is through ClientCursor::staticYield, specifically on line 512 (as Braam indicated), we concluded that the address of rec obtained on line 443 in db/clientcursor.cpp was either invalid when issued or became invalid before _ZN5mongo6Record5touchEb Also note that this problem occurred at two of our installations, and at each installation more than one time. All occurrences presented the same stack trace. We are in the process of scheduling and planning an upgrade, quite an onerous task to be honest. If we have not received any feedback regarding this from you by time we have completed planning we will upgrade to 2.2 - and continue to monitor. If we do install 2.2 and face similar issue we will update this issue with the details. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Wahlin [ 18/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Braam, We do recommend running with client libraries that are >= to the database version. We can't rule out a bug in an older client version as being the cause for this but there is nothing in the data we have here that would tell us one way or the other. My suggestion, given feasible, is for you is to upgrade both the client and the database to the latest stable 2.2 version of MongoDB to see whether it addresses. If you are unable to do this, given you built mongodb using scons (which will contain debug info), you could try running addr2line which will may give a more complete stack trace including inlined functions. Thanks, | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Braam van Heerden [ 14/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
I am in the process of seeing what the impact of an upgrade on the client's operations will be. However, something else that came to light just now during the investigation is that we are using an older version of the client libraries (1.8.3), and not 2.0.3/2.0.7 as the two server versions. Could this be the cause of the issue, or shouldn't the client version play much of a role in this case? | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 10/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
This doesn't look like anything else we've seen. Can you try 2.2.2? There have been various bug fixes, so at least we can rule some things out. |