[SERVER-8251] Startup hangs infinitely, DataFileSync background job cannot create new thread Created: 20/Jan/13  Updated: 15/Feb/13  Resolved: 22/Jan/13

Status: Closed
Project: Core Server
Component/s: Concurrency
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: WangYu Assignee: Andy Schwerin
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS release 5.8 (Final), 2.6.18-308.16.1.el5 #1 SMP Tue Oct 2 22:01:43 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux


Issue Links:
Duplicate
duplicates SERVER-7434 Startup race with --fork Closed
is duplicated by SERVER-8252 Startup hangs infinitely, DataFileSyn... Closed
Operating System: ALL
Participants:

 Description   

Installed the 2.2.0 rpm package from 10gen repo. 'service mongod start' creates 3 processes:

root 12603 9583 0 12:59 pts/0 00:00:00 /bin/sh /sbin/service mongod restart
root 12608 12603 0 12:59 pts/0 00:00:00 /bin/bash /etc/init.d/mongod restart
root 12622 12608 0 12:59 pts/0 00:00:00 runuser -s /bin/bash - mongod -c ulimit -S -c 0 >/dev/null 2>&1 ; numactl --interleave=all /usr/bin/mongod -f /etc/mongod.
mongod 12623 12622 0 12:59 ? 00:00:00 -bash -c ulimit -S -c 0 >/dev/null 2>&1 ; numactl --interleave=all /usr/bin/mongod -f /etc/mongod.conf
mongod 12645 12623 1 12:59 ? 00:00:00 /usr/bin/mongod -f /etc/mongod.conf
mongod 12647 12645 0 12:59 ? 00:00:00 /usr/bin/mongod -f /etc/mongod.conf
mongod 12648 12647 0 12:59 ? 00:00:00 /usr/bin/mongod -f /etc/mongod.conf

strace of PID 12648, the third - obviously hanging - process gives:
...
futex(0x13dc400, FUTEX_WAIT_PRIVATE, 2,

{0, 67669616}

) = -1 ETIMEDOUT (Connection timed out)
futex(0x13dc400, FUTEX_WAIT_PRIVATE, 2,

{0, 60868464}

) = -1 ETIMEDOUT (Connection timed out)
futex(0x13dc400, FUTEX_WAIT_PRIVATE, 2,

{0, 167533952}

) = -1 ETIMEDOUT (Connection timed out)
...

gdb:

Thread 2 (Thread 0x40a87940 (LWP 12649)):
#0 0x00000000008d2996 in base::internal::SpinLockDelay(int volatile*, int, int) ()
#1 0x000000000086210c in SpinLock::SlowLock() ()
#2 0x0000000000866056 in tcmalloc::ThreadCache::CreateCacheIfNecessary() ()
#3 0x00000000009b0857 in ?? ()
#4 0x0000000000c22872 in tc_malloc ()
#5 0x00000000009e63aa in boost::detail::get_once_per_thread_epoch() ()
#6 0x00000000007c4ff8 in void boost::call_once<void ()>(boost::once_flag&, void ()) ()
#7 0x00000000007c1e57 in boost::detail::set_current_thread_data(boost::detail::thread_data_base*) ()
#8 0x00000000007c3646 in ?? ()
#9 0x0000003124e0677d in start_thread () from /lib64/libpthread.so.0
#10 0x00000031246d3c1d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2b5613d478c0 (LWP 12648)):
#0 0x00000000008d2996 in base::internal::SpinLockDelay(int volatile*, int, int) ()
#1 0x000000000086210c in SpinLock::SlowLock() ()
#2 0x00000000008b92db in tcmalloc::CentralFreeList::Populate() ()
#3 0x00000000008b9498 in tcmalloc::CentralFreeList::FetchFromSpansSafe() ()
#4 0x00000000008b9534 in tcmalloc::CentralFreeList::RemoveRange(void*, void*, int) ()
#5 0x0000000000865c0d in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) ()
#6 0x00000000009b0f2f in ?? ()
#7 0x0000000000c21c95 in tc_new ()
#8 0x0000000000599e14 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf1<void, mongo::BackgroundJob, boost::shared_ptr<mongo::BackgroundJob::JobStatus> >, boost::_bi::list2<boost::_bi::value<mongo::BackgroundJob*>, boost::_bi::value<boost::shared_ptr<mongo::BackgroundJob::JobStatus> > > > >* boost::detail::heap_new_impl<boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf1<void, mongo::BackgroundJob, boost::shared_ptr<mongo::BackgroundJob::JobStatus> >, boost::_bi::list2<boost::_bi::value<mongo::BackgroundJob*>, boost::_bi::value<boost::shared_ptr<mongo::BackgroundJob::JobStatus> > > > >, boost::_bi::bind_t<void, boost::_mfi::mf1<void, mongo::BackgroundJob, boost::shared_ptr<mongo::BackgroundJob::JobStatus> >, boost::_bi::list2<boost::_bi::value<mongo::BackgroundJob*>, boost::_bi::value<boost::shared_ptr<mongo::BackgroundJob::JobStatus> > > >&>(boost::_bi::bind_t<void, boost::_mfi::mf1<void, mongo::BackgroundJob, boost::shared_ptr<mongo::BackgroundJob::JobStatus> >, boost::_bi::list2<boost::_bi::value<mongo::BackgroundJob*>, boost::_bi::value<boost::shared_ptr<mongo::BackgroundJob::JobStatus> > > >&) ()
#9 0x00000000005954fa in mongo::BackgroundJob::go() ()
#10 0x00000000005630cc in ?? ()
#11 0x0000000000565399 in main ()

This behaviour is somewhat random, because sometimes the startup works.

Notes: I rebuilt mongod from source r2.2.0, stripped the binary manually and to my surprise this binary, does not show this behaviour. Alas, another binary installed with 'scons install' always hangs.



 Comments   
Comment by Andy Schwerin [ 22/Jan/13 ]

Reproduced and confirmed duplicate of SERVER-7434. Fix to follow.

Comment by Andy Schwerin [ 22/Jan/13 ]

Two more questions.

1.) What is your virtualization platform – Amazon EC2, VMWare, Qemu/KVM, or something else?

2.) Do you see this hang with the 2.2.2 release (the current version) or just 2.2.0? I believe this problem will affect both problems, but continue to have trouble constructing a repro in house.

Comment by WangYu [ 22/Jan/13 ]

It's a VM.
-bash-4.1# uname -a
Linux vm210-171 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
-bash-4.1# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.2 (Santiago)

This issue usually happened when the CPU utilization is high.

Comment by Andy Schwerin [ 20/Jan/13 ]

I've heard of users experiencing this before, but have yet to reproduce it. Can you describe your hardware or virtual machine configuration?

Generated at Thu Feb 08 03:16:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.