Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Incomplete
Priority: Critical - P2
Fix Version/s: None
Affects Version/s: 2.0.7
Component/s: Stability
Labels:
None
Environment:
Solaris 10u9 amd64

Operating System:
Solaris
Steps To Reproduce:

Hide

Cannot reproduce in the lab

Show
Cannot reproduce in the lab
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

I have an issue with a Mongo 2.0.7 installation. Currently this DB is used only to cache data for communication between unrelated processes. The data is not sharded or replicated, and there is one instance running on each computer. There is a seperate replicated sharded set on the same set of computers, but we cannot replicate the issue in the shardeded set, only in the standalone instance.

From time to time (usually about once every 2 months or so), we are detecting that the Mongo local instance has gone down, interrupting IPC for our application suite, and also losing some client notifications. Unfortunately we have not yet been able to identify a series of environmental or data conditions that can trigger this. Unfortunately we also cannot get a SIGSEGV to generate a stack trace we can look at on Solaris. Instead, we have re-compiled the 2.0.7 source and added our own stack trace routines into the code to try and determine the exact point of the failure. I have attached all the stack traces we have to this ticket.

The stack traces are along these lines:

-----------------  lwp# 542 / thread# 542  --------------------
 fffffd7ffe56173a read     (39, 27b153a, 1e)
 0000000000c43bd4 _ZN4redi16basic_pstreambufIcSt11char_traitsIcEE11fill_bufferEb () + 164
 0000000000c449e7 _ZN4redi16basic_pstreambufIcSt11char_traitsIcEE9underflowEv () + 27
 0000000000c3d219 _ZNSs12_S_constructISt19istreambuf_iteratorIcSt11char_traitsIcEEEEPcT_S5_RKSaIcESt18input_iterator_tag () + 559
 0000000000e5f570 _ZN5mongo11sunosPstackERSob () + 480
 00000000008cb880 _ZN5mongo10abruptQuitEi () + 3e0
 00000000008cbdf0 _ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv () + 240
 fffffd7ffe55c2e6 __sighndlr () + 6
 fffffd7ffe550bc2 call_user_handler () + 252
 fffffd7ffe550dee sigacthandler (b, fffffd7ff9a07d78, fffffd7ff9a07a10) + ee
 --- called from signal handler with signal 11 (SIGSEGV) ---
 000000000082dec0 _ZN5mongo6Record5touchEb ()
 000000000079233b _ZN5mongo12ClientCursor5yieldEiPNS_6RecordE () + 6b
 000000000079241c _ZN5mongo12ClientCursor14yieldSometimesENS0_11RecordNeedsEPb () + 9c
 000000000086639f _ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverEb () + 5df
 00000000008695a6 _ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEb () + 116
 0000000000810af9 _ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE () + 359
 0000000000811d6b _ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE () + ecb
 0000000000e5cbec _ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE () + ec
 0000000000707b49 _ZN5mongo3pms9threadRunEPNS_13MessagingPortE () + 269
 fffffd7ffe851506 thread_proxy () + 66
 fffffd7ffe55bfbb _thr_setup () + 5b
 fffffd7ffe55c1e0 _lwp_start ()

From what we can determine, the issue is on line 443 in db/clientcursor.cpp, where an invalid pointer (not NULL) is used:

db/clientcursor.cpp-433-        }
db/clientcursor.cpp-434-        else {
db/clientcursor.cpp-435-            warning() << "don't understand RecordNeeds: " << (int)need << endl;
db/clientcursor.cpp-436-            return 0;
db/clientcursor.cpp-437-        }
db/clientcursor.cpp-438-
db/clientcursor.cpp-439-        DiskLoc l = currLoc();
db/clientcursor.cpp-440-        if ( l.isNull() )
db/clientcursor.cpp-441-            return 0;
db/clientcursor.cpp-442-        
db/clientcursor.cpp:443:        Record * rec = l.rec();
db/clientcursor.cpp-444-        if ( rec->likelyInPhysicalMemory() ) 
db/clientcursor.cpp-445-            return 0;
db/clientcursor.cpp-446-        
db/clientcursor.cpp-447-        return rec;
db/clientcursor.cpp-448-    }
db/clientcursor.cpp-449-
db/clientcursor.cpp-450-    bool ClientCursor::yieldSometimes( RecordNeeds need, bool *yielded ) {
db/clientcursor.cpp-451-        if ( yielded ) {
db/clientcursor.cpp-452-            *yielded = false;   
db/clientcursor.cpp-453-        }

This then subsequently causes a SIGSEGV in db/clientcursor.cpp line 512:

db/clientcursor.cpp-502-                CurOp * c = cc().curop();
db/clientcursor.cpp-503-                while ( c->parent() )
db/clientcursor.cpp-504-                    c = c->parent();
db/clientcursor.cpp-505-                warning() << "ClientCursor::yield can't unlock b/c of recursive lock"
db/clientcursor.cpp-506-                          << " ns: " << ns 
db/clientcursor.cpp-507-                          << " top: " << c->info()
db/clientcursor.cpp-508-                          << endl;
db/clientcursor.cpp-509-            }
db/clientcursor.cpp-510-
db/clientcursor.cpp-511-            if ( rec )
db/clientcursor.cpp:512:                rec->touch();
db/clientcursor.cpp-513-
db/clientcursor.cpp-514-            lk.reset(0); // need to release this before dbtempreleasecond
db/clientcursor.cpp-515-        }
db/clientcursor.cpp-516-    }
db/clientcursor.cpp-517-
db/clientcursor.cpp-518-    bool ClientCursor::prepareToYield( YieldData &data ) {
db/clientcursor.cpp-519-        if ( ! _c->supportYields() )
db/clientcursor.cpp-520-            return false;
db/clientcursor.cpp-521-        if ( ! _c->prepareToYield() ) {
db/clientcursor.cpp-522-            return false;

Unfortunately we cannot determine why the pointer becomes invalid, and what conditions are needed for this.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

mongo-sigsegv.tgz
18 kB
Jan 08 2013 11:13:43 AM UTC

Assignee:: James Wahlin
Reporter:: Braam van Heerden
Participants:: Braam van Heerden, Eliot Horowitz, Iwan Aucamp, James Wahlin, Tad Marshall
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Jan 08 2013 11:13:43 AM UTC
Updated:: Mar 08 2013 03:55:32 PM UTC
Resolved:: Feb 12 2013 05:05:18 PM UTC

Details

Description

Attachments

Attachments

Forms

Activity

People

Dates