[SERVER-44094] Deadlock involving "NoopWriter", "ftdc" and the "signalProcessingThread" on shutDown Created: 18/Oct/19  Updated: 29/Oct/23  Resolved: 05/Dec/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.3.1

Type: Bug Priority: Major - P3
Reporter: Lingzhi Deng Assignee: Jason Chan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2019-11-18, Repl 2019-12-02, Repl 2019-12-16
Participants:
Linked BF Score: 6

 Description   

Mutex at 0x7f340cbbf020 held by "ftdc" (Thread 0x7f33f243b700 (LWP 106255)) waited on by "NoopWriter" (Thread 0x7f33d9903700 (LWP 107435))
Mutex at 0x7f340cade1b0 held by "signalProcessingThread" (Thread 0x7f33fd84d700 (LWP 106162)) waited on by "ftdc" (Thread 0x7f33f243b700 (LWP 106255))

And the "signalProcessingThread" was waiting for the "NoopWriter" thread to exit while holding the ReplicationCoordinatorExternalStateImpl _threadMutex.

"NoopWriter" was blocked on the ReplicationCoordinator mutex.

Thread 49: "NoopWriter" (Thread 0x7f33d9903700 (LWP 107435))
#0  0x00007f340368e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f34036895d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007f34036894a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f3408c6cbce in __gthread_mutex_lock (__mutex=0x7f340cbbf020) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/x86_64-mongodb-linux/bits/gthr-default.h:748
#4  std::mutex::lock (this=0x7f340cbbf020) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/bits/std_mutex.h:103
#5  mongo::Mutex::lock (this=this@entry=0x7f340cbbf008) at src/mongo/platform/mutex.cpp:42
#6  0x00007f340720a0ba in std::lock_guard<mongo::Latch>::lock_guard (__m=..., this=<synthetic pointer>) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/bits/std_mutex.h:161
#7  mongo::repl::ReplicationCoordinatorImpl::getMyLastAppliedOpTime (this=0x7f340cbbf000) at src/mongo/db/repl/replication_coordinator_impl.cpp:1282
#8  0x00007f34071df02a in mongo::repl::NoopWriter::_writeNoop (this=0x7f340c8f8000, opCtx=<optimized out>) at src/mongo/db/repl/noop_writer.cpp:186

"ftdc" was blocked on the ReplicationCoordinatorExternalStateImpl _threadMutex while holding the ReplicationCoordinator mutex.

Thread 19: "ftdc" (Thread 0x7f33f243b700 (LWP 106255))
#0  0x00007f340368e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f34036895d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007f34036894a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f3408c6cbce in __gthread_mutex_lock (__mutex=0x7f340cade1b0) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/x86_64-mongodb-linux/bits/gthr-default.h:748
#4  std::mutex::lock (this=0x7f340cade1b0) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/bits/std_mutex.h:103
#5  mongo::Mutex::lock (this=this@entry=0x7f340cade198) at src/mongo/platform/mutex.cpp:42
#6  0x00007f34071e0a16 in std::lock_guard<mongo::Latch>::lock_guard (__m=..., this=<synthetic pointer>) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/bits/std_mutex.h:161
#7  mongo::repl::ReplicationCoordinatorExternalStateImpl::tooStale (this=0x7f340cade180) at src/mongo/db/repl/replication_coordinator_external_state_impl.cpp:796
#8  0x00007f3406dbc4fe in mongo::repl::ReplicationCoordinatorImpl::processReplSetGetStatus (this=0x7f340cbbf000, response=0x7f33f243a5e0, responseStyle=<optimized out>) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/bits/unique_ptr.h:342
#9  0x00007f34071ffc59 in mongo::repl::CmdReplSetGetStatus::run (this=<optimized out>, opCtx=0x7f34118b9500, cmdObj=..., result=...) at src/mongo/db/repl/repl_set_get_status_cmd.cpp:67

And the "signalProcessingThread" was waiting for the "NoopWriter" thread to exit while holding the ReplicationCoordinatorExternalStateImpl _threadMutex.

Thread 2: "signalProcessingThread" (Thread 0x7f33fd84d700 (LWP 106162))
#0  0x00007f34036882fd in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f3408df44e3 in __gthread_join (__value_ptr=0x0, __threadid=<optimized out>) at /data/mci/aef77c5561b2a8b9cb8fe784756498d5/toolchain-builder/tmp/build-gcc-v3.sh-xJ5/build/x86_64-mongodb-linux/libstdc++-v3/include/x86_64-mongodb-linux/bits/gthr-default.h:668
#2  std::thread::join (this=this@entry=0x7f34118e01b0) at ../../../../../src/combined/libstdc++-v3/src/c++11/thread.cc:110
#3  0x00007f34071de459 in mongo::repl::NoopWriter::PeriodicNoopRunner::~PeriodicNoopRunner (this=0x7f34118e00e0, __in_chrg=<optimized out>) at src/mongo/db/repl/noop_writer.cpp:78
#4  std::default_delete<mongo::repl::NoopWriter::PeriodicNoopRunner>::operator() (this=0x7f340c8f8058, __ptr=0x7f34118e00e0) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/bits/unique_ptr.h:81
#5  std::unique_ptr<mongo::repl::NoopWriter::PeriodicNoopRunner, std::default_delete<mongo::repl::NoopWriter::PeriodicNoopRunner> >::reset (__p=0x7f34118e00e0, this=0x7f340c8f8058) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/bits/unique_ptr.h:382
#6  mongo::repl::NoopWriter::stopWritingPeriodicNoops (this=0x7f340c8f8000) at src/mongo/db/repl/noop_writer.cpp:144
#7  0x00007f34071e386f in mongo::repl::ReplicationCoordinatorExternalStateImpl::shutdown (this=0x7f340cade180, opCtx=0x7f3411c34cc0) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/bits/unique_ptr.h:342
#8  0x00007f3407210465 in mongo::repl::ReplicationCoordinatorImpl::shutdown (this=0x7f340cbbf000, opCtx=0x7f3411c34cc0) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/bits/unique_ptr.h:342
#9  0x00007f3407121ec3 in mongo::(anonymous namespace)::shutdownTask (shutdownArgs=...) at src/mongo/db/db.cpp:959
#10 0x00007f3408c742f5 in mongo::unique_function<void (mongo::ShutdownTaskArgs const&)>::operator()(mongo::ShutdownTaskArgs const&) const (args#0=..., this=<optimized out>) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/bits/stl_deque.h:255
#11 mongo::(anonymous namespace)::runTasks(std::stack<mongo::unique_function<void(const mongo::ShutdownTaskArgs&)>, std::deque<mongo::unique_function<void(const mongo::ShutdownTaskArgs&)>, std::allocator<mongo::unique_function<void(const mongo::ShutdownTaskArgs&)> > > >, const mongo::ShutdownTaskArgs &) (tasks=std::stack wrapping: std::deque with 1 element = {...}, shutdownArgs=...) at src/mongo/util/exit.cpp:61
#12 0x00007f3407081833 in mongo::shutdown (code=code@entry=mongo::EXIT_CLEAN, shutdownArgs=...) at /opt/mongodbtoolchain/revisions/7249e4fd664e3abbf947f1d1034d4c95d08ce280/stow/gcc-v3.qi0/include/c++/8.2.0/bits/move.h:182
#13 0x00007f34078b68c3 in mongo::exitCleanly (code=mongo::EXIT_CLEAN) at src/mongo/util/exit.h:100
#14 mongo::(anonymous namespace)::signalProcessingThread (rotate=mongo::LogFileStatus::kNeedToRotateLogFile) at src/mongo/util/signal_handlers.cpp:199



 Comments   
Comment by Githook User [ 05/Dec/19 ]

Author:

{'name': 'Jason Chan', 'username': 'jasonjhchan', 'email': 'jason.chan@mongodb.com'}

Message: SERVER-44094 Do not hold the ReplicationCoordinatorExternalStateImpl mutex when shutting down the NoOpWriter
Branch: master
https://github.com/mongodb/mongo/commit/60f700791f65ca676da5e4f0fada421958a8b709

Comment by Siyuan Zhou [ 21/Oct/19 ]

Enforce locking ordering will not solve this issue since waiting for join is involved. We probably should enforce the rule that waiting on join shouldn't hold any mutex.

Generated at Thu Feb 08 05:04:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.