[SERVER-16834] Secondary nodes can hang during shutdown if BGSync::_buffer is full Created: 13/Jan/15  Updated: 21/Jan/15  Resolved: 14/Jan/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.8.0-rc4
Fix Version/s: 2.8.0-rc5

Type: Bug Priority: Major - P3
Reporter: Andy Schwerin Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-16396 Replication stall, then one secondary... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

During shutdown, it is possible for the replication consumer threads to stop pulling items out of the BGSync::_buffer queue, while the produce thread (oplog tailer/bgsync thread) is blocked trying to insert an item into the same, fixed-sized queue.

For example, in 2.8.0-rc5-pre-, we can see the following two stacks in a hung system. Thread 3 is stuck because nobody is draining the BGSync::_buffer, and thread 2 is stuck because thread 3 never makes progress and so never checks for shutdown.

Thread 3 (Thread 0x7ed14c6f9700 (LWP 17201)):
#0  0x0000003887c0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000bf4690 in void boost::condition_variable_any::wait<boost::unique_lock<boost::timed_mutex> >(boost::unique_lock<boost::timed_mutex>&) ()
    at src/third_party/boost/boost/thread/pthread/condition_variable.hpp:137
#2  0x0000000000bf82d3 in mongo::repl::BackgroundSync::produce(mongo::OperationContext*) () at src/mongo/util/queue.h:76
#3  0x0000000000bf981e in mongo::repl::BackgroundSync::_producerThread() () at src/mongo/db/repl/bgsync.cpp:193
...

Thread 2 (Thread 0x7ed12c747700 (LWP 17397)):
#0  0x0000003887c0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000f9a8db in boost::thread::join() () at src/third_party/boost/boost/thread/pthread/condition_variable.hpp:56
#2  0x0000000000c564a5 in mongo::repl::ReplicationCoordinatorExternalStateImpl::shutdown() () at src/mongo/db/repl/replication_coordinator_external_state_impl.cpp:107
 
#3  0x0000000000c5b1f3 in mongo::repl::ReplicationCoordinatorImpl::shutdown() () at src/mongo/db/repl/replication_coordinator_impl.cpp:371
#4  0x0000000000aa429a in mongo::exitCleanly(mongo::ExitCode) () at src/mongo/db/instance.cpp:1101
#5  0x00000000009cf75a in mongo::CmdShutdown::shutdownHelper() () at src/mongo/db/dbcommands_generic.cpp:325
...



 Comments   
Comment by Githook User [ 15/Jan/15 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-16834 add comments
Branch: master
https://github.com/mongodb/mongo/commit/b19f1aa26afe5d8b8cbd62b067fc67a353eb48a9

Comment by Githook User [ 14/Jan/15 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-16834 bgsync::shutdown() now unblocks threads waiting in BlockingQueue::push()
Branch: master
https://github.com/mongodb/mongo/commit/ec1736b050bff0dd5a4ef8037d951e58a8e21da3

Generated at Thu Feb 08 03:42:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.