[SERVER-24582] MongoDB secondary node crashes randomly when primary node is killed/dies Created: 15/Jun/16  Updated: 08/Jan/24  Resolved: 24/Jun/16

Status: Closed
Project: Core Server
Component/s: Networking, Replication
Affects Version/s: 3.2.6
Fix Version/s: 3.2.8, 3.3.9

Type: Bug Priority: Major - P3
Reporter: Jimmy [X] Assignee: Mira Carey
Resolution: Done Votes: 0
Labels: code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

FreeBSD 11 CURRENT


Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: Platforms 16 (06/24/16)
Participants:

 Description   

The crash comes from unhandled exception thrown in ~AsyncStream() destructor, which calls destroyStream() function, which, in turn, calls incorrect version of close() on ASIO socket.

Relevant code piece from async_stream_common.h

....
 
std::error_code ec;
    stream->shutdown(asio::ip::tcp::socket::shutdown_both, ec);
    stream->close(); // <----- probably meant to call another overload close(ec)
    if (ec) {
        logCloseFailed(ec);
}
 
....

Most likely another close() overload was meant to be used that accepts ec as an argument. The overload used here just throws an exceptions if socket is closed.



 Comments   
Comment by Githook User [ 24/Jun/16 ]

Author:

{u'username': u'hanumantmk', u'name': u'Jason Carey', u'email': u'jcarey@argv.me'}

Message: SERVER-24582 handle errors in asio close

(cherry picked from commit b370d7ef98005a32ae9534a7a4f62fb7539ab85d)
Branch: v3.2
https://github.com/mongodb/mongo/commit/59ec7595abf1971122fa21fa7f2c211c3292dbeb

Comment by Githook User [ 24/Jun/16 ]

Author:

{u'username': u'hanumantmk', u'name': u'Jason Carey', u'email': u'jcarey@argv.me'}

Message: SERVER-24582 handle errors in asio close
Branch: master
https://github.com/mongodb/mongo/commit/b370d7ef98005a32ae9534a7a4f62fb7539ab85d

Comment by Jimmy [X] [ 24/Jun/16 ]

Thanks for the heads-up, because this bug is practically a stopper right now for deploying MongoDB in production with our FreeBSD setup.

Comment by Ramon Fernandez Marina [ 24/Jun/16 ]

Thanks for the additional information KarrotKake. We have a fix in code review, so we expect to be able to fix this bug soon. Please continue to watch the ticket for updates.

Regards,
Ramón.

Comment by Jimmy [X] [ 24/Jun/16 ]

Stack trace from GDB:

#0  0x00000008042762da in thr_kill () from /lib/libc.so.7
#1  0x00000008042762ab in raise () from /lib/libc.so.7
#2  0x00000000012736ef in mongo::(anonymous namespace)::endProcessWithSignal (signalNum=6)
    at src/mongo/util/signal_handlers_synchronous.cpp:92
#3  mongo::(anonymous namespace)::myTerminate () at src/mongo/util/signal_handlers_synchronous.cpp:232
#4  0x00000000009997ab in __clang_call_terminate ()
#5  0x000000000104cd4d in mongo::executor::AsyncStream::~AsyncStream (this=<optimized out>)
    at src/mongo/executor/async_stream.cpp:48
#6  0x000000000104cd6e in mongo::executor::AsyncStream::~AsyncStream (this=0x80a035d00)
    at src/mongo/executor/async_stream.cpp:46
#7  0x000000000105b9ea in std::__1::default_delete<mongo::executor::AsyncStreamInterface>::operator() (__ptr=0x1891a, 
    this=<optimized out>) at /usr/include/c++/v1/memory:2516
#8  0x000000000105b713 in std::__1::default_delete<mongo::executor::NetworkInterfaceASIO::AsyncOp>::operator() (
    __ptr=0x80a23d800, this=<optimized out>) at /usr/include/c++/v1/memory:2516
#9  std::__1::unique_ptr<mongo::executor::NetworkInterfaceASIO::AsyncOp, std::__1::default_delete<mongo::executor::NetworkInterfaceASIO::AsyncOp> >::reset (this=<optimized out>, __p=<optimized out>) at /usr/include/c++/v1/memory:2715
#10 std::__1::unique_ptr<mongo::executor::NetworkInterfaceASIO::AsyncOp, std::__1::default_delete<mongo::executor::NetworkInterfaceASIO::AsyncOp> >::~unique_ptr (this=<optimized out>) at /usr/include/c++/v1/memory:2683
#11 mongo::executor::connection_pool_asio::ASIOConnection::~ASIOConnection (this=0x80a22e700)
    at src/mongo/executor/connection_pool_asio.h:70
#12 0x000000000105b7ae in mongo::executor::connection_pool_asio::ASIOConnection::~ASIOConnection (this=0x80a22e700)
    at src/mongo/executor/connection_pool_asio.h:70
#13 0x0000000001054ac1 in std::__1::default_delete<mongo::executor::ConnectionPool::ConnectionInterface>::operator() (
    __ptr=0x80a22e700, this=<optimized out>) at /usr/include/c++/v1/memory:2516
#14 std::__1::unique_ptr<mongo::executor::ConnectionPool::ConnectionInterface, std::__1::default_delete<mongo::executor::ConnectionPool::ConnectionInterface> >::reset (this=<optimized out>, __p=<optimized out>) at /usr/include/c++/v1/memory:2715
#15 mongo::executor::ConnectionPool::SpecificPool::fulfillRequests (this=<optimized out>, lk=...)
    at src/mongo/executor/connection_pool.cpp:461
#16 0x00000000010533e6 in mongo::executor::ConnectionPool::SpecificPool::getConnection(mongo::HostAndPort const&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >, std::__1::unique_lock<std::__1::mutex>, std::__1::function<void (mongo::StatusWith<std::__1::unique_ptr<mongo::executor::ConnectionPool::ConnectionInterface, mongo::executor::ConnectionPo---Type <return> to continue, or q <return> to quit---
ol::ConnectionHandleDeleter> >)>) (this=0x80a227000, hostAndPort=..., timeout=..., lk=..., cb=...)
    at src/mongo/executor/connection_pool.cpp:281
#17 0x00000000010531a4 in mongo::executor::ConnectionPool::get(mongo::HostAndPort const&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >, std::__1::function<void (mongo::StatusWith<std::__1::unique_ptr<mongo::executor::ConnectionPool::ConnectionInterface, mongo::executor::ConnectionPool::ConnectionHandleDeleter> >)>) (this=<optimized out>, 
    hostAndPort=..., timeout=..., cb=...) at src/mongo/executor/connection_pool.cpp:212
#18 0x0000000001060126 in mongo::executor::NetworkInterfaceASIO::startCommand(mongo::executor::TaskExecutor::CallbackHandle const&, mongo::executor::RemoteCommandRequest const&, std::__1::function<void (mongo::StatusWith<mongo::executor::RemoteCommandResponse> const&)> const&) (this=<optimized out>, cbHandle=..., request=..., onFinish=...)
    at src/mongo/executor/network_interface_asio.cpp:314
#19 0x0000000000ece71f in mongo::repl::ReplicationExecutor::scheduleRemoteCommand(mongo::executor::RemoteCommandRequest const&, std::__1::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)> const&) (
    this=<optimized out>, request=..., cb=...) at src/mongo/db/repl/replication_executor.cpp:344
#20 0x0000000000ec2b1a in mongo::repl::ReplicationCoordinatorImpl::_doMemberHeartbeat (this=0x804bdce00, cbData=..., 
    target=..., targetIndex=<optimized out>) at src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp:98
#21 0x0000000000ec8d41 in std::__1::__invoke<void (mongo::repl::ReplicationCoordinatorImpl::*&)(mongo::executor::TaskExecutor::CallbackArgs, mongo::HostAndPort const&, int), mongo::repl::ReplicationCoordinatorImpl*&, mongo::executor::TaskExecutor::CallbackArgs const&, mongo::HostAndPort&, int&, void> (__args=<optimized out>, __f=<optimized out>, 
    __a0=<optimized out>, __args=<optimized out>, __args=<optimized out>) at /usr/include/c++/v1/__functional_base:382
#22 std::__1::__apply_functor<void (mongo::repl::ReplicationCoordinatorImpl::*)(mongo::executor::TaskExecutor::CallbackArgs, mongo::HostAndPort const&, int), std::__1::tuple<mongo::repl::ReplicationCoordinatorImpl*, std::__1::placeholders::__ph<1>, mongo::HostAndPort, int>, 0ul, 1ul, 2ul, 3ul, std::__1::tuple<mongo::executor::TaskExecutor::CallbackArgs const&> > (
    __f=<optimized out>, __bound_args=..., __args=<optimized out>) at /usr/include/c++/v1/functional:2056
#23 std::__1::__bind<void (mongo::repl::ReplicationCoordinatorImpl::*)(mongo::executor::TaskExecutor::CallbackArgs, mongo::HostAndPort const&, int), mongo::repl::ReplicationCoordinatorImpl*, std::__1::placeholders::__ph<1>&, mongo::HostAndPort const&, int&>::operator()<mongo::executor::TaskExecutor::CallbackArgs const&> (this=0x807fe2db8, __args=...)
    at /usr/include/c++/v1/functional:2119
#24 std::__1::__invoke<std::__1::__bind<void (mongo::repl::ReplicationCoordinatorImpl::*)(mongo::executor::TaskExecutor::CallbackArgs, mongo::HostAndPort const&, int), mongo::repl::ReplicationCoordinatorImpl*, std::__1::placeholders::__ph<1>&, mongo::HostAndPort const&, int&>&, mongo::executor::TaskExecutor::CallbackArgs const&> (__f=..., __args=...)
    at /usr/include/c++/v1/__functional_base:415
---Type <return> to continue, or q <return> to quit---
#25 std::__1::__invoke_void_return_wrapper<void>::__call<std::__1::__bind<void (mongo::repl::ReplicationCoordinatorImpl::*)(mongo::executor::TaskExecutor::CallbackArgs, mongo::HostAndPort const&, int), mongo::repl::ReplicationCoordinatorImpl*, std::__1::placeholders::__ph<1>&, mongo::HostAndPort const&, int&>&, mongo::executor::TaskExecutor::CallbackArgs const&> (
    __args=..., __args=...) at /usr/include/c++/v1/__functional_base:440
#26 0x0000000000ed0130 in std::__1::function<void ()>::operator()() const (this=<optimized out>)
    at /usr/include/c++/v1/functional:1789
#27 mongo::repl::(anonymous namespace)::callNoExcept(std::__1::function<void ()> const&) (fn=...)
    at src/mongo/db/repl/replication_executor.cpp:637
#28 0x0000000000ecbef0 in std::__1::function<void ()>::operator()() const (this=<optimized out>)
    at /usr/include/c++/v1/functional:1789
#29 mongo::repl::ReplicationExecutor::run (this=<optimized out>) at src/mongo/db/repl/replication_executor.cpp:146
#30 0x0000000000ed02d0 in mongo::repl::ReplicationExecutor::startup()::$_0::operator()() const (this=<optimized out>)
    at src/mongo/db/repl/replication_executor.cpp:159
#31 std::__1::__invoke<mongo::repl::ReplicationExecutor::startup()::$_0> (__f=<optimized out>)
    at /usr/include/c++/v1/__functional_base:415
#32 std::__1::__thread_execute<mongo::repl::ReplicationExecutor::startup()::$_0>(std::__1::tuple<mongo::repl::ReplicationExecutor::startup()::$_0>&, std::__1::__tuple_indices<>) (__t=...) at /usr/include/c++/v1/thread:337
#33 std::__1::__thread_proxy<std::__1::tuple<mongo::repl::ReplicationExecutor::startup()::$_0> >(void*) (__vp=0x804a30588)
    at /usr/include/c++/v1/thread:347
#34 0x0000000803f90aa5 in ?? () from /lib/libthr.so.3
#35 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffde3f0000
(gdb) frame 5
#5  0x000000000104cd4d in mongo::executor::AsyncStream::~AsyncStream (this=<optimized out>)
    at src/mongo/executor/async_stream.cpp:48
48	}
(gdb) list
43	AsyncStream::AsyncStream(asio::io_service::strand* strand)
44	    : _strand(strand), _stream(_strand->get_io_service()) {}
45	
46	AsyncStream::~AsyncStream() {
47	    destroyStream(&_stream, _connected);
48	}
49	
50	void AsyncStream::connect(tcp::resolver::iterator iter, ConnectHandler&& connectHandler) {
51	    asio::async_connect(
52	        _stream,
(gdb)

Log at the time of crash:

2016-06-14T10:48:42.241+0300 I REPL     [ReplicationExecutor] could not find member to sync from
2016-06-14T10:48:42.247+0300 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to <erased host>:27017
2016-06-14T10:48:42.404+0300 F -        [ReplicationExecutor] terminate() called. An exception is active; attempting to gather more information
2016-06-14T10:48:42.486+0300 F -        [ReplicationExecutor] std::exception::what(): close: Connection reset by peer
Actual exception type: std::__1::system_error
 
 0x127428b 0x12736b3 0x9997ab 0x104cd4d 0x104cd6e 0x105b9ea 0x105b713 0x105b7ae 0x1054ac1 0x10533e6 0x10531a4 0x1060126 0xece71f 0xec2b1a 0xec8d41 0xed0130 0xecbef0 0xed02d0 0x803f90aa5
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"E7428B","s":"_ZN5mongo15printStackTraceERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEE"},{"b":"400000","o":"E736B3","s":"_ZN5mongo30setupSynchronousSignalHandlersEv"},{"b":"400000","o":"5997AB","s":"_ZN5mongo15NamespaceStringD2Ev"},{"b":"400000","o":"C4CD4D","s":"_ZN5mongo8executor11AsyncStreamD2Ev"},{"b":"400000","o":"C4CD6E","s":"_ZN5mongo8executor11AsyncStreamD0Ev"},{"b":"400000","o":"C5B9EA","s":"_ZN5mongo8executor20NetworkInterfaceASIO7AsyncOpD2Ev"},{"b":"400000","o":"C5B713","s":"_ZN5mongo8executor20connection_pool_asio14ASIOConnectionD2Ev"},{"b":"400000","o":"C5B7AE","s":"_ZN5mongo8executor20connection_pool_asio14ASIOConnectionD0Ev"},{"b":"400000","o":"C54AC1","s":"_ZN5mongo8executor14ConnectionPool12SpecificPool15fulfillRequestsERNSt3__111unique_lockINS3_5mutexEEE"},{"b":"400000","o":"C533E6","s":"_ZN5mongo8executor14ConnectionPool12SpecificPool13getConnectionERKNS_11HostAndPortENSt3__16chrono8durationIxNS6_5ratioILl1ELl1000EEEEENS6_11unique_lockINS6_5mutexEEENS6_8functionIFvNS_10StatusWithINS6_10unique_ptrINS1_19ConnectionInterfaceENS1_23ConnectionHandleDeleterEEEEEEEE"},{"b":"400000","o":"C531A4","s":"_ZN5mongo8executor14ConnectionPool3getERKNS_11HostAndPortENSt3__16chrono8durationIxNS5_5ratioILl1ELl1000EEEEENS5_8functionIFvNS_10StatusWithINS5_10unique_ptrINS1_19ConnectionInterfaceENS1_23ConnectionHandleDeleterEEEEEEEE"},{"b":"400000","o":"C60126","s":"_ZN5mongo8executor20NetworkInterfaceASIO12startCommandERKNS0_12TaskExecutor14CallbackHandleERKNS0_20RemoteCommandRequestERKNSt3__18functionIFvRKNS_10StatusWithINS0_21RemoteCommandResponseEEEEEE"},{"b":"400000","o":"ACE71F","s":"_ZN5mongo4repl19ReplicationExecutor21scheduleRemoteCommandERKNS_8executor20RemoteCommandRequestERKNSt3__18functionIFvRKNS2_12TaskExecutor25RemoteCommandCallbackArgsEEEE"},{"b":"400000","o":"AC2B1A","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl18_doMemberHeartbeatENS_8executor12TaskExecutor12CallbackArgsERKNS_11HostAndPortEi"},{"b":"400000","o":"AC8D41","s":"_ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRNS_6__bindIMN5mongo4repl26ReplicationCoordinatorImplEFvNS4_8executor12TaskExecutor12CallbackArgsERKNS4_11HostAndPortEiEJPS6_RNS_12placeholders4__phILi1EEESC_RiEEERKS9_EEEvDpOT_"},{"b":"400000","o":"AD0130","s":"_ZN5mongo4repl19ReplicationExecutor8Callback17waitForCompletionEv"},{"b":"400000","o":"ACBEF0","s":"_ZN5mongo4repl19ReplicationExecutor3runEv"},{"b":"400000","o":"AD02D0","s":"_ZN5mongo4repl19ReplicationExecutor8Callback17waitForCompletionEv"},{"b":"803F88000","o":"8AA5","s":"pthread_create"}],"processInfo":{ "mongodbVersion" : "3.2.6", "gitVersion" : "05552b562c7a0b3143a729aaa0838e558dc49b25", "compiledModules" : [], "uname" : { "sysname" : "FreeBSD", "release" : "11.0-ALPHA3", "version" : "FreeBSD 11.0-ALPHA3 #0 r301815: Sat Jun 11 11:13:36 EEST 2016     root@bsd1.dev.ukr.net:/usr/obj/usr/src/sys/SYSTEM", "machine" : "amd64" } }}
 mongod(_ZN5mongo15printStackTraceERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEE+0x2B) [0x127428b]
 mongod(_ZN5mongo30setupSynchronousSignalHandlersEv+0x2C3) [0x12736b3]
 mongod(_ZN5mongo15NamespaceStringD2Ev+0x1B) [0x9997ab]
 mongod(_ZN5mongo8executor11AsyncStreamD2Ev+0x5D) [0x104cd4d]
 mongod(_ZN5mongo8executor11AsyncStreamD0Ev+0xE) [0x104cd6e]
 mongod(_ZN5mongo8executor20NetworkInterfaceASIO7AsyncOpD2Ev+0xCA) [0x105b9ea]
 mongod(_ZN5mongo8executor20connection_pool_asio14ASIOConnectionD2Ev+0x43) [0x105b713]
 mongod(_ZN5mongo8executor20connection_pool_asio14ASIOConnectionD0Ev+0xE) [0x105b7ae]
 mongod(_ZN5mongo8executor14ConnectionPool12SpecificPool15fulfillRequestsERNSt3__111unique_lockINS3_5mutexEEE+0x281) [0x1054ac1]
 mongod(_ZN5mongo8executor14ConnectionPool12SpecificPool13getConnectionERKNS_11HostAndPortENSt3__16chrono8durationIxNS6_5ratioILl1ELl1000EEEEENS6_11unique_lockINS6_5mutexEEENS6_8functionIFvNS_10StatusWithINS6_10unique_ptrINS1_19ConnectionInterfaceENS1_23ConnectionHandleDeleterEEEEEEEE+0x176) [0x10533e6]
 mongod(_ZN5mongo8executor14ConnectionPool3getERKNS_11HostAndPortENSt3__16chrono8durationIxNS5_5ratioILl1ELl1000EEEEENS5_8functionIFvNS_10StatusWithINS5_10unique_ptrINS1_19ConnectionInterfaceENS1_23ConnectionHandleDeleterEEEEEEEE+0x1B4) [0x10531a4]
 mongod(_ZN5mongo8executor20NetworkInterfaceASIO12startCommandERKNS0_12TaskExecutor14CallbackHandleERKNS0_20RemoteCommandRequestERKNSt3__18functionIFvRKNS_10StatusWithINS0_21RemoteCommandResponseEEEEEE+0x506) [0x1060126]
 mongod(_ZN5mongo4repl19ReplicationExecutor21scheduleRemoteCommandERKNS_8executor20RemoteCommandRequestERKNSt3__18functionIFvRKNS2_12TaskExecutor25RemoteCommandCallbackArgsEEEE+0x7FF) [0xece71f]
 mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl18_doMemberHeartbeatENS_8executor12TaskExecutor12CallbackArgsERKNS_11HostAndPortEi+0x39A) [0xec2b1a]
 mongod(_ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRNS_6__bindIMN5mongo4repl26ReplicationCoordinatorImplEFvNS4_8executor12TaskExecutor12CallbackArgsERKNS4_11HostAndPortEiEJPS6_RNS_12placeholders4__phILi1EEESC_RiEEERKS9_EEEvDpOT_+0x71) [0xec8d41]
 mongod(_ZN5mongo4repl19ReplicationExecutor8Callback17waitForCompletionEv+0x50) [0xed0130]
 mongod(_ZN5mongo4repl19ReplicationExecutor3runEv+0x4C0) [0xecbef0]
 mongod(_ZN5mongo4repl19ReplicationExecutor8Callback17waitForCompletionEv+0x1F0) [0xed02d0]
libthr.so.3(pthread_create+0x925) [0x803f90aa5]
-----  END BACKTRACE  -----

Notice the "2016-06-14T10:48:42.486+0300 F - [ReplicationExecutor] std::exception::what(): close: Connection reset by peer" from the log. That is ASIO's socket throwing an exception when called on already reset/closed connection.

Comment by Jimmy [X] [ 16/Jun/16 ]

As soon as I'll be able to get to them I'll post them. But even at a glance there is an inconsistency in this code: you use non-throwing shutdown(), but at the same time use throwing close() and don't handle potential exception.

Comment by Ramon Fernandez Marina [ 15/Jun/16 ]

KarrotKake, can you please upload the logs for the affected node(s)? If there's a stack trace it would be useful to have it on the ticket.

Thanks,
Ramón.

Generated at Thu Feb 08 04:06:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.