[SERVER-18203] Aggregation distributed deadlock with sharding Created: 14/Apr/15  Updated: 19/Sep/15  Resolved: 05/May/15

Status: Closed
Project: Core Server
Component/s: MapReduce, Sharding
Affects Version/s: 3.1.1
Fix Version/s: 3.1.3

Type: Bug Priority: Critical - P2
Reporter: Samantha Ritter (Inactive) Assignee: Kaloian Manassiev
Resolution: Done Votes: 0
Labels: UT
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-19947 Deadlock between aggregations with so... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 3 05/15/15
Participants:

 Description   

Several test suites timed out on Linux 64-bit debug:
https://mci.10gen.com/build/mongodb_mongo_master_linux_64_debug_37faabab8c5dcca19c99efa7aeab11d4ca283320_15_04_14_12_30_18

Logs from aggregation_auth:
https://mci.10gen.com/task_log_raw/mongodb_mongo_master_linux_64_debug_37faabab8c5dcca19c99efa7aeab11d4ca283320_15_04_14_12_30_18_aggregation_auth_linux_64_debug/0?type=T

jsCore_auth:
https://mci.10gen.com/task_log_raw/mongodb_mongo_master_linux_64_debug_37faabab8c5dcca19c99efa7aeab11d4ca283320_15_04_14_12_30_18_jsCore_auth_linux_64_debug/0?type=T

jsCore_compatibility_WT:
https://mci.10gen.com/task_log_raw/mongodb_mongo_master_linux_64_debug_37faabab8c5dcca19c99efa7aeab11d4ca283320_15_04_14_12_30_18_jsCore_compatibility_WT_linux_64_debug/0?type=T



 Comments   
Comment by Githook User [ 05/May/15 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-18203 Delay locking for getMore with AggCursor

Otherwise, in the sharded case, the recipient of the network call might in
turn make another call back to the original host and cause distributed
deadlock as evidenced in the associated ticket.
Branch: master
https://github.com/mongodb/mongo/commit/6923cb27e54fdaa10ad33748c472c5cf6e557630

Comment by Kaloian Manassiev [ 14/Apr/15 ]

I think this is a distributed deadlock. Here are the deadlocked threads between the first and second mongod instances in a sharded environment. The problem is that the aggregation framework calls a remote getMore while holding a database lock and the second mongod instance calls back into the first.

This should be fixed by either yielding the journal lock or ideally by not making remote calls while hold a lock.

[2015/04/14 10:50:56.148] Thread 1 (Thread 0x7f91af479710 (LWP 10696)):
 [2015/04/14 10:50:56.148] #0  0x00007f91ae4c0c81 in recv () from /lib64/libpthread.so.0
 [2015/04/14 10:50:56.148] #1  0x00000000007ecf7a in mongo::Socket::_recv(char*, int) ()
 [2015/04/14 10:50:56.148] #2  0x00000000007ecfb9 in mongo::Socket::unsafe_recv(char*, int) ()
 [2015/04/14 10:50:56.148] #3  0x00000000007ed005 in mongo::Socket::recv(char*, int) ()
 [2015/04/14 10:50:56.148] #4  0x00000000007e73f9 in mongo::MessagingPort::recv(mongo::Message&) ()
 [2015/04/14 10:50:56.148] #5  0x00000000007e806f in mongo::MessagingPort::recv(mongo::Message const&, mongo::Message&) ()
 [2015/04/14 10:50:56.148] #6  0x00000000006b7299 in mongo::DBClientConnection::call(mongo::Message&, mongo::Message&, bool, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*) ()
 [2015/04/14 10:50:56.149] #7  0x00000000006c9034 in mongo::DBClientCursor::requestMore() ()
 [2015/04/14 10:50:56.149] #8  0x00000000006c9657 in mongo::DBClientCursor::more() ()
 [2015/04/14 10:50:56.149] #9  0x0000000000785e87 in mongo::internalCursorHasNext(mongo::V8Scope*, v8::Arguments const&) ()
 [2015/04/14 10:50:56.149] #10 0x0000000000774ed8 in mongo::V8Scope::v8Callback(v8::Arguments const&) ()
 [2015/04/14 10:50:56.149] #11 0x000022e704c148f9 in ?? ()
 [2015/04/14 10:50:56.149] #12 0x00007fff9e024c88 in ?? ()
 [2015/04/14 10:50:56.149] #13 0x00007fff9e024c88 in ?? ()
 [2015/04/14 10:50:56.149] #14 0x0000000000000000 in ?? ()
 
 
========================================
 
 
[2015/04/14 10:50:56.330] Thread 8 (Thread 0x46af6940 (LWP 10716)):
 [2015/04/14 10:50:56.330] #0  0x00007f8a819bbc81 in recv () from /lib64/libpthread.so.0
 [2015/04/14 10:50:56.331] #1  0x0000000000f06eda in mongo::Socket::_recv(char*, int) ()
 [2015/04/14 10:50:56.331] #2  0x0000000000f06f19 in mongo::Socket::unsafe_recv(char*, int) ()
 [2015/04/14 10:50:56.331] #3  0x0000000000f06f65 in mongo::Socket::recv(char*, int) ()
 [2015/04/14 10:50:56.331] #4  0x0000000000efd329 in mongo::MessagingPort::recv(mongo::Message&) ()
 [2015/04/14 10:50:56.331] #5  0x0000000000efee47 in mongo::PortMessageServer::handleIncomingMsg(void*)
 [2015/04/14 10:50:56.331]     ()
 [2015/04/14 10:50:56.331] #6  0x00007f8a819b473d in start_thread () from /lib64/libpthread.so.0
 [2015/04/14 10:50:56.331] #7  0x00007f8a8172ad1d in clone () from /lib64/libc.so.6
 [2015/04/14 10:50:56.331] Thread 7 (Thread 0x41a6a940 (LWP 10778)):
 [2015/04/14 10:50:56.331] #0  0x00007f8a819bbc81 in recv () from /lib64/libpthread.so.0
 [2015/04/14 10:50:56.331] #1  0x0000000000f06eda in mongo::Socket::_recv(char*, int) ()
 [2015/04/14 10:50:56.331] #2  0x0000000000f06f19 in mongo::Socket::unsafe_recv(char*, int) ()
 [2015/04/14 10:50:56.331] #3  0x0000000000f06f65 in mongo::Socket::recv(char*, int) ()
 [2015/04/14 10:50:56.331] #4  0x0000000000efd329 in mongo::MessagingPort::recv(mongo::Message&) ()
 [2015/04/14 10:50:56.331] #5  0x0000000000efdf9f in mongo::MessagingPort::recv(mongo::Message const&, mongo::Message&) ()
 [2015/04/14 10:50:56.331] #6  0x00000000008712f9 in mongo::DBClientConnection::call(mongo::Message&, mongo::Message&, bool, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*) ()
 [2015/04/14 10:50:56.331] #7  0x0000000000882af4 in mongo::DBClientCursor::requestMore() ()
 [2015/04/14 10:50:56.331] #8  0x0000000000883117 in mongo::DBClientCursor::more() ()
 [2015/04/14 10:50:56.331] #9  0x0000000000b551c0 in mongo::sorter::MergeIterator<mongo::Value, mongo::Document, mongo::DocumentSourceSort::Comparator>::next() ()
 [2015/04/14 10:50:56.331] #10 0x0000000000b51413 in mongo::DocumentSourceSort::getNext() ()
 [2015/04/14 10:50:56.331] #11 0x0000000000a21094 in mongo::PipelineProxyStage::getNextBson() ()
 [2015/04/14 10:50:56.331] #12 0x0000000000a214be in mongo::PipelineProxyStage::isEOF() ()
 [2015/04/14 10:50:56.331] #13 0x0000000000b9ff24 in mongo::shouldSaveCursorGetMore(mongo::PlanExecutor::ExecState, mongo::PlanExecutor*, bool) ()
 [2015/04/14 10:50:56.331] #14 0x0000000000ba0680 in mongo::getMore(mongo::OperationContext*, char const*, int, long long, mongo::CurOp&, int, bool&, bool*) ()
 [2015/04/14 10:50:56.331] #15 0x0000000000ab0cf5 in mongo::receivedGetMore(mongo::OperationContext*, mongo::DbResponse&, mongo::Message&, mongo::CurOp&) ()
 [2015/04/14 10:50:56.331] #16 0x0000000000ab583b in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
 [2015/04/14 10:50:56.331] #17 0x000000000081df85 in mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*, mongo::LastError*) ()
 [2015/04/14 10:50:56.331] #18 0x0000000000efee6b in mongo::PortMessageServer::handleIncomingMsg(void*)
 
[2015/04/14 10:50:56.332] Thread 2 (Thread 0x4807b940 (LWP 11041)):
 [2015/04/14 10:50:56.332] #0  0x00007f8a819b9150 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
 [2015/04/14 10:50:56.332]    from /lib64/libpthread.so.0
 [2015/04/14 10:50:56.332] #1  0x00000000009cc88a in mongo::CondVarLockGrantNotification::wait(unsigned int) ()
 [2015/04/14 10:50:56.332] #2  0x00000000009d3699 in mongo::LockerImpl<true>::lockComplete(mongo::ResourceId, mongo::LockMode, unsigned int, bool) ()
 [2015/04/14 10:50:56.332] #3  0x00000000009d3dcf in mongo::LockerImpl<true>::lock(mongo::ResourceId, mongo::LockMode, unsigned int, bool) ()
 [2015/04/14 10:50:56.332] #4  0x00000000009cf667 in mongo::LockerImpl<true>::lockMMAPV1Flush() ()
 [2015/04/14 10:50:56.332] #5  0x00000000009c2c6a in mongo::Lock::DBLock::DBLock(mongo::Locker*, mongo::StringData, mongo::LockMode) ()
 [2015/04/14 10:50:56.332] #6  0x0000000000ba0628 in mongo::getMore(mongo::OperationContext*, char const*, int, long long, mongo::CurOp&, int, bool&, bool*) ()
 [2015/04/14 10:50:56.332] #7  0x0000000000ab0cf5 in mongo::receivedGetMore(mongo::OperationContext*, mongo::DbResponse&, mongo::Message&, mongo::CurOp&) ()
 [2015/04/14 10:50:56.333] #8  0x0000000000ab583b in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
 [2015/04/14 10:50:56.333] #9  0x000000000081df85 in mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*, mongo::LastError*) ()
 [2015/04/14 10:50:56.333] #10 0x0000000000efee6b in mongo::PortMessageServer::handleIncomingMsg(void*)

Comment by Mark Benvenuto [ 14/Apr/15 ]

The jsCore tests failed due to Python hanging in waitpid(). See BF-764

See pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0) in the stack below.

 [2015/04/14 11:02:53.933] Total Threads: 1
 [2015/04/14 11:02:53.933] Thread 47292757737872
 [2015/04/14 11:02:53.933]   File "buildscripts/smoke.py", line 1444, in <module>
 [2015/04/14 11:02:53.933]     main()
 [2015/04/14 11:02:53.933]   File "buildscripts/smoke.py", line 1428, in main
 [2015/04/14 11:02:53.933]     run_tests(tests)
 [2015/04/14 11:02:53.933]   File "buildscripts/smoke.py", line 795, in run_tests
 [2015/04/14 11:02:53.933]     master.stop()
 [2015/04/14 11:02:53.933]   File "buildscripts/smoke.py", line 330, in stop
 [2015/04/14 11:02:53.933]     self.proc.wait()
 [2015/04/14 11:02:53.933]   File "/usr/lib64/python2.6/subprocess.py", line 1182, in wait
 [2015/04/14 11:02:53.933]     pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
 [2015/04/14 11:02:53.933]   File "/usr/lib64/python2.6/subprocess.py", line 455, in _eintr_retry_call
 [2015/04/14 11:02:53.933]     return func(*args)
 [2015/04/14 11:02:53.933]   File "buildscripts/smoke.py", line 138, in dump_stacks
 [2015/04/14 11:02:53.933]     print "".join(traceback.format_stack(stack))

Generated at Thu Feb 08 03:46:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.