[SERVER-3339] mongod process hangs after USR1 signal Created: 26/Jun/11  Updated: 24/Jul/12  Resolved: 24/Jul/12

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alexander Azarov Assignee: Mathias Stearn
Resolution: Duplicate Votes: 0
Labels: crash, logging
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian "squeeze" with "mongodb" package version 1.8.1-1

  1. uname -a
    Linux r4 2.6.38-2-amd64 #1 SMP Sun May 8 13:51:57 UTC 2011 x86_64 GNU/Linux
  1. dpkg -l | grep mongod
    ii mongodb 1:1.8.1-1 An object/document-oriented database (metapackage)

Issue Links:
Duplicate
duplicates SERVER-4739 Race condition in log rotation (was: ... Closed
Operating System: Linux
Participants:

 Description   

I do log rotation once per day at night. The cron job is as follows:

  1. cat /etc/cron.d/mongodb
    05 0 * * * root [ -f /var/run/mongodb.pid ] && /bin/kill -USR1 `cat /var/run/mongodb.pid`

Approximately once per week the mongod process hangs after the log rotation:

  • It renames log files,
  • the new mongodb.log file is empty,
  • it's impossible to connect to the server (it prints the header with version however)
  • the only way to stop it is to send KILL


 Comments   
Comment by Mathias Stearn [ 21/Jan/12 ]

Please follow SERVER-4739 for the solution to this issue

Comment by Leonid Evdokimov [ 28/Dec/11 ]

Yes, I've done some stress-testing and confirmed the bug (got deadlock).

I've exhausted all the connection-threads because of dead-lock in non-reenterable libc (every thread goes to malloc() once and locks there):

Thread 809 (Thread 34009080 (LWP 100433/mongod)):
#0 0x0000000033b76c4c in nanosleep () from /lib/libc.so.7
#1 0x000000003399122a in pthread_mutex_getprioceiling () from /lib/libthr.so.3
#2 0x0000000033b150ad in _malloc_prefork () from /lib/libc.so.7
#3 0x0000000033b1859e in malloc () from /lib/libc.so.7
#4 0x0000000033064ffd in operator new () from /usr/lib/libstdc++.so.6
#5 0x0000000033016821 in std::string::_Rep::_S_create () from /usr/lib/libstdc++.so.6
#6 0x00000000330175a5 in std::string::_S_copy_chars () from /usr/lib/libstdc++.so.6
#7 0x00000000330176c2 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string () from /usr/lib/libstdc++.so.6
#8 0x000000000056d1ff in mongo::LoggingManager::rotate ()
#9 <signal handler called>
#10 0x0000000033b1030d in _pthread_mutex_init_calloc_cb () from /lib/libc.so.7
#11 0x0000000033b155d5 in _malloc_prefork () from /lib/libc.so.7
#12 0x0000000033b16b83 in _malloc_thread_cleanup () from /lib/libc.so.7
#13 0x00000000339956aa in pthread_exit () from /lib/libthr.so.3
#14 0x000000003398c4d9 in pthread_getprio () from /lib/libthr.so.3
#15 0x00007ffffa3d4000 in ?? ()

So, the problem is really still there in 2.0.2.

Comment by Leonid Evdokimov [ 26/Dec/11 ]

It's probably not fixed yet.
I've experienced almost same issue (crash instead of deadlock) in 1.8.1 and I would like to attach backtrace and some comments. I know, that 1.8.1 is not supported anymore, but `master' branch seems to use same code for log rotation at this moment.

Here is backtrace for one of the mongod threads:

Thread 3 (Thread 802008980 (LWP 104574/mongod)):
#0  0x00000008019983bc in __error () from /lib/libthr.so.3
#1  0x00007ffff87c47b3 in ?? ()
#2  0x000000080200d698 in ?? ()
#3  0x00007ffff87c4670 in ?? ()
#4  0x00007ffff87c47b0 in ?? ()
#5  0x00007ffff87c48dc in ?? ()
#6  0x000000080200f178 in ?? ()
#7  0x0000000807c8ed80 in ?? ()
#8  0x00007ffff87c4db8 in ?? ()
#9  0x0000000800d34958 in VTT for std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> > () from /usr/lib/libstdc++.so.6
#10 0x00007ffff87c4d60 in ?? ()
#11 <signal handler called>
#12 0x0000000801af410c in thr_kill () from /lib/libc.so.7
#13 0x0000000801b9083b in abort () from /lib/libc.so.7
#14 0x0000000801b79c15 in __assert () from /lib/libc.so.7
#15 0x00000000007523e4 in mongo::dbexit (rc=mongo::EXIT_ABRUPT, why=0x8dba9d "", tryToGetLock=Variable "tryToGetLock" is not available.
) at mutex.hpp:137
#16 0x00000000008adac4 in mongo::abruptQuit (x=Variable "x" is not available.
) at db/db.cpp:1116
#17 <signal handler called>
#18 0x0000000801af410c in thr_kill () from /lib/libc.so.7
#19 0x0000000801b9083b in abort () from /lib/libc.so.7
#20 0x0000000801b79c15 in __assert () from /lib/libc.so.7
#21 0x00000000007523e4 in mongo::dbexit (rc=mongo::EXIT_ABRUPT, why=0x8dba9d "", tryToGetLock=Variable "tryToGetLock" is not available.
) at mutex.hpp:137
#22 0x00000000008adac4 in mongo::abruptQuit (x=Variable "x" is not available.
) at db/db.cpp:1116
#23 <signal handler called>
....
#31859 <signal handler called>
#31860 0x0000000801af410c in thr_kill () from /lib/libc.so.7
#31861 0x0000000801b9083b in abort () from /lib/libc.so.7
#31862 0x0000000801b79c15 in __assert () from /lib/libc.so.7
#31863 0x00000000004cc6e1 in mongo::Logstream::flush (this=0x807c7c680, t=0x0) at mutex.hpp:137
#31864 0x00000000004cafee in mongo::Logstream::operator<< (this=0x807c7c680, _endl=Variable "_endl" is not available.
) at log.h:230
#31865 0x000000000075e609 in ~Client (this=0x807c8ed80) at db/client.cpp:71
#31866 0x0000000000762881 in boost::thread_specific_ptr<mongo::Client>::delete_data::operator() (this=Variable "this" is not available.
) at tss.hpp:42
#31867 0x0000000800e77b89 in boost::detail::set_tss_data () from /usr/local/lib/libboost_thread.so
#31868 0x000000000075b152 in __tcf_8 () at tss.hpp:79
#31869 0x0000000801b72906 in __cxa_finalize () from /lib/libc.so.7
#31870 0x0000000801b1d037 in exit () from /lib/libc.so.7
#31871 0x0000000000751f3f in mongo::dbexit (rc=mongo::EXIT_ABRUPT, why=0x8dba9d "", tryToGetLock=Variable "tryToGetLock" is not available.
) at db/instance.cpp:834
#31872 0x00000000008adac4 in mongo::abruptQuit (x=Variable "x" is not available.
) at db/db.cpp:1116
#31873 <signal handler called>
#31874 0x0000000801af410c in thr_kill () from /lib/libc.so.7
#31875 0x0000000801b9083b in abort () from /lib/libc.so.7
#31876 0x0000000801b79c15 in __assert () from /lib/libc.so.7
#31877 0x00000000004cc6e1 in mongo::Logstream::flush (this=0x807c7c680, t=0x0) at mutex.hpp:137
#31878 0x00000000004cafee in mongo::Logstream::operator<< (this=0x807c7c680, _endl=Variable "_endl" is not available.
) at log.h:230
#31879 0x000000000075129c in mongo::shutdownServer () at db/instance.cpp:721
#31880 0x0000000000751faf in mongo::dbexit (rc=mongo::EXIT_ABRUPT, why=0x8dba9d "", tryToGetLock=Variable "tryToGetLock" is not available.
) at db/instance.cpp:821
#31881 0x00000000008adac4 in mongo::abruptQuit (x=Variable "x" is not available.
) at db/db.cpp:1116
#31882 <signal handler called>
#31883 0x0000000801af410c in thr_kill () from /lib/libc.so.7
#31884 0x0000000801b9083b in abort () from /lib/libc.so.7
#31885 0x0000000801b79c15 in __assert () from /lib/libc.so.7
#31886 0x00000000005496e4 in mongo::LoggingManager::rotate (this=0xada320) at mutex.hpp:137
#31887 <signal handler called>
#31888 0x0000000801b93a3c in write () from /lib/libc.so.7
#31889 0x0000000801b934d0 in memcpy () from /lib/libc.so.7
#31890 0x0000000801b9341b in memcpy () from /lib/libc.so.7
#31891 0x0000000801b9200d in f_prealloc () from /lib/libc.so.7
#31892 0x0000000801b9210a in fflush () from /lib/libc.so.7
#31893 0x00000000004cc3de in mongo::Logstream::flush (this=0x807c7c680, t=0x0) at log.h:459
#31894 0x00000000004cafee in mongo::Logstream::operator<< (this=0x807c7c680, _endl=Variable "_endl" is not available.
) at log.h:230
#31895 0x0000000000769a6d in mongo::DataFileMgr::findAll (ns=0x807fc4c14 "juggler.mondata", startLoc=Variable "startLoc" is not available.
) at db/pdfile.cpp:676
#31896 0x000000000076a14d in mongo::findTableScan (ns=0x807fc4c14 "juggler.mondata", order=Variable "order" is not available.
) at db/pdfile.cpp:691
#31897 0x0000000000790901 in mongo::QueryPlan::newCursor (this=0x802992dc0, startLoc=@0x7ffff95ca6d0, numWanted=Variable "numWanted" is not available.
) at db/queryoptimizer.cpp:201
#31898 0x00000000006628b0 in mongo::UpdateOp::_init (this=0x807c35180) at db/update.cpp:897
#31899 0x000000000078dee9 in mongo::QueryPlanSet::Runner::initOp (op=@0x807c35180) at queryoptimizer.h:117
#31900 0x0000000000796863 in mongo::QueryPlanSet::Runner::run (this=0x7ffff95cab40) at db/queryoptimizer.cpp:586
#31901 0x000000000079aa51 in mongo::QueryPlanSet::runOp (this=0x807e47800, op=@0x8029bbf40) at db/queryoptimizer.cpp:493
#31902 0x000000000079b934 in mongo::MultiPlanScanner::runOpOnce (this=0x807e47900, op=@0x8029bbf40) at db/queryoptimizer.cpp:715
#31903 0x0000000000644938 in mongo::MultiCursor::nextClause (this=0x807cf05b0) at queryoptimizer.h:295
#31904 0x000000000064de74 in MultiCursor (this=0x807cf05b0, ns=0x807fc4c14 "juggler.mondata", pattern=@0x7ffff95cb570, order=@0x7ffff95cb300, op=@0x7ffff95cb430, mayYield=Variable "mayYield" is not available.
) at queryoptimizer.h:344
#31905 0x000000000065d7f8 in mongo::_updateObjects (god=false, ns=0x807fc4c14 "juggler.mondata", updateobj=@0x7ffff95cb820, patternOrig=@0x7ffff95cb570, upsert=false, multi=false, logop=true, debug=@0x807fc5398, rs=0x0) at db/update.cpp:1070
#31906 0x0000000000661260 in mongo::updateObjects (ns=0x807fc4c14 "juggler.mondata", updateobj=@0x7ffff95cb820, patternOrig=@0x7ffff95cb800, upsert=Variable "upsert" is not available.
) at db/update.cpp:1284
#31907 0x0000000000752b78 in mongo::receivedUpdate (m=Variable "m" is not available.
) at db/instance.cpp:468
#31908 0x00000000007561cb in mongo::assembleResponse (m=@0x7ffff95cbd90, dbresponse=@0x7ffff95cbec0, client=@0x8020b2638) at db/instance.cpp:319
#31909 0x00000000008ae739 in mongo::connThread (inPort=0x8020b2600) at db/db.cpp:237
#31910 0x0000000800e78187 in thread_proxy () from /usr/local/lib/libboost_thread.so
#31911 0x000000080198e4d1 in pthread_getprio () from /lib/libthr.so.3
#31912 0x00007ffff93cc000 in ?? ()
Cannot access memory at address 0x7ffff95cc000

I have following records in the log:

Sun Dec 25 00:00:09 Got signal: 6 (Abort trap: 6).
 
Sun Dec 25 00:00:09 Backtrace:
 
Sun Dec 25 00:00:09 dbexit: 
Sun Dec 25 00:00:09 Got signal: 6 (Abort trap: 6).
 
Sun Dec 25 00:00:09 Backtrace:
 
Sun Dec 25 00:00:09 dbexit: ; exiting immediately
Sun Dec 25 00:00:09 Got signal: 6 (Abort trap: 6).
 
Sun Dec 25 00:00:09 Backtrace:
 
Sun Dec 25 00:00:09 Got signal: 6 (Abort trap: 6).
 
Sun Dec 25 00:00:09 Backtrace:
 
Sun Dec 25 00:00:09 Got signal: 6 (Abort trap: 6).
 
Sun Dec 25 00:00:09 Backtrace:

So, as far as I see, the problem is that rotation is done inside of signal handler - that can be called anywhere in the code. So, some FILE* is used in asynchronous manner and libc assert() is triggered.

There are at least two issues in this code:

1. SIGABRT handler is not one-shot, so failed assertion inside of SIGABRT handler leads to another SIGABRT handler and so on until stack is exhausted and SIGSEGV arrives. There is no portable way to install one-shot signal with signal() and sigaction() with SA_RESETHAND should be used in this case.
Also, I believe, that you have to specify SA_NODEFER for SIGSEGV, SIGABRT and alike signals. But I may be wrong.

2. SIGUSR1 handler uses libc heavily, and that's not correct behavior. There is awesome paper on that "Delivering Signals for Fun and Profit" at http://lcamtuf.coredump.cx/signals.txt

I'm running mongod on FreeBSD 8.2.

Comment by Eliot Horowitz (Inactive) [ 03/Sep/11 ]

Thanks for the update

Comment by Alexander Azarov [ 03/Sep/11 ]

This never happened since upgrade to 1.8.2 and have no more servers with 1.8.1...

Generated at Thu Feb 08 03:02:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.