-
Type:
Bug
-
Resolution: Incomplete
-
Priority:
Critical - P2
-
None
-
Affects Version/s: 2.0.7
-
Component/s: Stability
-
None
-
Environment:Solaris 10u9 amd64
-
Solaris
-
I have an issue with a Mongo 2.0.7 installation. Currently this DB is used only to cache data for communication between unrelated processes. The data is not sharded or replicated, and there is one instance running on each computer. There is a seperate replicated sharded set on the same set of computers, but we cannot replicate the issue in the shardeded set, only in the standalone instance.
From time to time (usually about once every 2 months or so), we are detecting that the Mongo local instance has gone down, interrupting IPC for our application suite, and also losing some client notifications. Unfortunately we have not yet been able to identify a series of environmental or data conditions that can trigger this. Unfortunately we also cannot get a SIGSEGV to generate a stack trace we can look at on Solaris. Instead, we have re-compiled the 2.0.7 source and added our own stack trace routines into the code to try and determine the exact point of the failure. I have attached all the stack traces we have to this ticket.
The stack traces are along these lines:
----------------- lwp# 542 / thread# 542 -------------------- fffffd7ffe56173a read (39, 27b153a, 1e) 0000000000c43bd4 _ZN4redi16basic_pstreambufIcSt11char_traitsIcEE11fill_bufferEb () + 164 0000000000c449e7 _ZN4redi16basic_pstreambufIcSt11char_traitsIcEE9underflowEv () + 27 0000000000c3d219 _ZNSs12_S_constructISt19istreambuf_iteratorIcSt11char_traitsIcEEEEPcT_S5_RKSaIcESt18input_iterator_tag () + 559 0000000000e5f570 _ZN5mongo11sunosPstackERSob () + 480 00000000008cb880 _ZN5mongo10abruptQuitEi () + 3e0 00000000008cbdf0 _ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv () + 240 fffffd7ffe55c2e6 __sighndlr () + 6 fffffd7ffe550bc2 call_user_handler () + 252 fffffd7ffe550dee sigacthandler (b, fffffd7ff9a07d78, fffffd7ff9a07a10) + ee --- called from signal handler with signal 11 (SIGSEGV) --- 000000000082dec0 _ZN5mongo6Record5touchEb () 000000000079233b _ZN5mongo12ClientCursor5yieldEiPNS_6RecordE () + 6b 000000000079241c _ZN5mongo12ClientCursor14yieldSometimesENS0_11RecordNeedsEPb () + 9c 000000000086639f _ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverEb () + 5df 00000000008695a6 _ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEb () + 116 0000000000810af9 _ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE () + 359 0000000000811d6b _ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE () + ecb 0000000000e5cbec _ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE () + ec 0000000000707b49 _ZN5mongo3pms9threadRunEPNS_13MessagingPortE () + 269 fffffd7ffe851506 thread_proxy () + 66 fffffd7ffe55bfbb _thr_setup () + 5b fffffd7ffe55c1e0 _lwp_start ()
From what we can determine, the issue is on line 443 in db/clientcursor.cpp, where an invalid pointer (not NULL) is used:
db/clientcursor.cpp-433- } db/clientcursor.cpp-434- else { db/clientcursor.cpp-435- warning() << "don't understand RecordNeeds: " << (int)need << endl; db/clientcursor.cpp-436- return 0; db/clientcursor.cpp-437- } db/clientcursor.cpp-438- db/clientcursor.cpp-439- DiskLoc l = currLoc(); db/clientcursor.cpp-440- if ( l.isNull() ) db/clientcursor.cpp-441- return 0; db/clientcursor.cpp-442- db/clientcursor.cpp:443: Record * rec = l.rec(); db/clientcursor.cpp-444- if ( rec->likelyInPhysicalMemory() ) db/clientcursor.cpp-445- return 0; db/clientcursor.cpp-446- db/clientcursor.cpp-447- return rec; db/clientcursor.cpp-448- } db/clientcursor.cpp-449- db/clientcursor.cpp-450- bool ClientCursor::yieldSometimes( RecordNeeds need, bool *yielded ) { db/clientcursor.cpp-451- if ( yielded ) { db/clientcursor.cpp-452- *yielded = false; db/clientcursor.cpp-453- }
This then subsequently causes a SIGSEGV in db/clientcursor.cpp line 512:
db/clientcursor.cpp-502- CurOp * c = cc().curop(); db/clientcursor.cpp-503- while ( c->parent() ) db/clientcursor.cpp-504- c = c->parent(); db/clientcursor.cpp-505- warning() << "ClientCursor::yield can't unlock b/c of recursive lock" db/clientcursor.cpp-506- << " ns: " << ns db/clientcursor.cpp-507- << " top: " << c->info() db/clientcursor.cpp-508- << endl; db/clientcursor.cpp-509- } db/clientcursor.cpp-510- db/clientcursor.cpp-511- if ( rec ) db/clientcursor.cpp:512: rec->touch(); db/clientcursor.cpp-513- db/clientcursor.cpp-514- lk.reset(0); // need to release this before dbtempreleasecond db/clientcursor.cpp-515- } db/clientcursor.cpp-516- } db/clientcursor.cpp-517- db/clientcursor.cpp-518- bool ClientCursor::prepareToYield( YieldData &data ) { db/clientcursor.cpp-519- if ( ! _c->supportYields() ) db/clientcursor.cpp-520- return false; db/clientcursor.cpp-521- if ( ! _c->prepareToYield() ) { db/clientcursor.cpp-522- return false;
Unfortunately we cannot determine why the pointer becomes invalid, and what conditions are needed for this.