[SERVER-15880] parallel test suite timeout due to deadlock involving the DB catalog entry mutex Created: 30/Oct/14  Updated: 11/Jul/16  Resolved: 03/Nov/14

Status: Closed
Project: Core Server
Component/s: Storage, Testing Infrastructure
Affects Version/s: 2.7.8
Fix Version/s: 2.8.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Kaloian Manassiev Assignee: Kaloian Manassiev
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-15888 Implicit DB creations on insert and u... Closed
Operating System: ALL
Participants:

 Description   

 [2014/10/30 03:52:53.853] Thread 7 (Thread 0x41097940 (LWP 15153)):
 [2014/10/30 03:52:53.853] #0  0x00007fd9b2976150 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
 [2014/10/30 03:52:53.853]    from /lib64/libpthread.so.0
 [2014/10/30 03:52:53.853] #1  0x00000000009dd79a in boost::condition_variable::timed_wait(boost::unique_lock<boost::mutex>&, boost::posix_time::ptime const&) ()
 [2014/10/30 03:52:53.853] #2  0x00000000009d609d in mongo::CondVarLockGrantNotification::wait(unsigned int) ()
 [2014/10/30 03:52:53.853] #3  0x00000000009ddb73 in mongo::LockerImpl<true>::lock(mongo::ResourceId const&, mongo::LockMode, unsigned int, bool) ()
 [2014/10/30 03:52:53.853] #4  0x00000000009d9190 in mongo::LockerImpl<true>::endWriteUnitOfWork() ()
 [2014/10/30 03:52:53.853] #5  0x000000000092b08c in mongo::WriteUnitOfWork::commit() ()
 [2014/10/30 03:52:53.853] #6  0x0000000000d6eff8 in mongo::MMAPV1DatabaseCatalogEntry::MMAPV1DatabaseCatalogEntry(mongo::OperationContext*, mongo::StringData const&, mongo::StringData const&, bool, bool) ()
 [2014/10/30 03:52:53.853] #7  0x0000000000d731ef in mongo::MMAPV1Engine::getDatabaseCatalogEntry(mongo::OperationContext*, mongo::StringData const&) ()
 [2014/10/30 03:52:53.853] #8  0x00000000009371a4 in mongo::DatabaseHolder::openDb(mongo::OperationContext*, mongo::StringData const&, bool*) ()
 [2014/10/30 03:52:53.853] #9  0x0000000000943055 in mongo::Client::Context::_finishInit() ()
 [2014/10/30 03:52:53.853] #10 0x0000000000943111 in mongo::Client::Context::Context(mongo::OperationContext*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()
 [2014/10/30 03:52:53.853] #11 0x0000000000abea48 in mongo::receivedInsert(mongo::OperationContext*, mongo::Message&, mongo::CurOp&) ()
 [2014/10/30 03:52:53.853] #12 0x0000000000ac09d0 in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&, bool) ()
 [2014/10/30 03:52:53.853] #13 0x000000000081b722 in mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*, mongo::LastError*) ()
 [2014/10/30 03:52:53.853] #14 0x0000000000f12641 in mongo::PortMessageServer::handleIncomingMsg(void*)
 [2014/10/30 03:52:53.853]     ()
 
 [2014/10/30 03:52:53.853] Thread 6 (Thread 0x47a15940 (LWP 15155)):
 [2014/10/30 03:52:53.853] #0  0x00007fd9b29784c4 in __lll_lock_wait () from /lib64/libpthread.so.0
 [2014/10/30 03:52:53.853] #1  0x00007fd9b2973e1a in _L_lock_1034 () from /lib64/libpthread.so.0
 [2014/10/30 03:52:53.853] #2  0x00007fd9b2973cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
 [2014/10/30 03:52:53.853] #3  0x00000000008a7480 in boost::unique_lock<boost::mutex>::lock() ()
 [2014/10/30 03:52:53.853] #4  0x0000000000d730ff in mongo::MMAPV1Engine::getDatabaseCatalogEntry(mongo::OperationContext*, mongo::StringData const&) ()
 [2014/10/30 03:52:53.853] #5  0x00000000009371a4 in mongo::DatabaseHolder::openDb(mongo::OperationContext*, mongo::StringData const&, bool*) ()
 [2014/10/30 03:52:53.853] #6  0x0000000000943055 in mongo::Client::Context::_finishInit() ()
 [2014/10/30 03:52:53.853] #7  0x0000000000943111 in mongo::Client::Context::Context(mongo::OperationContext*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()
 [2014/10/30 03:52:53.853] #8  0x0000000000abea48 in mongo::receivedInsert(mongo::OperationContext*, mongo::Message&, mongo::CurOp&) ()
 [2014/10/30 03:52:53.853] #9  0x0000000000ac09d0 in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&, bool) ()
 [2014/10/30 03:52:53.853] #10 0x000000000081b722 in mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*, mongo::LastError*) ()
 [2014/10/30 03:52:53.853] #11 0x0000000000f12641 in mongo::PortMessageServer::handleIncomingMsg(void*)
 [2014/10/30 03:52:53.853]     ()
 
[2014/10/30 03:52:53.850] Thread 19 (Thread 0x42f85940 (LWP 13870)):
 [2014/10/30 03:52:53.850] #0  0x00007fd9b2976150 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
 [2014/10/30 03:52:53.850]    from /lib64/libpthread.so.0
 [2014/10/30 03:52:53.850] #1  0x00000000009dd79a in boost::condition_variable::timed_wait(boost::unique_lock<boost::mutex>&, boost::posix_time::ptime const&) ()
 [2014/10/30 03:52:53.850] #2  0x00000000009d609d in mongo::CondVarLockGrantNotification::wait(unsigned int) ()
 [2014/10/30 03:52:53.850] #3  0x00000000009ddb73 in mongo::LockerImpl<true>::lock(mongo::ResourceId const&, mongo::LockMode, unsigned int, bool) ()
 [2014/10/30 03:52:53.850] #4  0x00000000009d588c in mongo::AutoAcquireFlushLockForMMAPV1Commit::AutoAcquireFlushLockForMMAPV1Commit(mongo::Locker*) ()
 [2014/10/30 03:52:53.850] #5  0x0000000000d4b4aa in ?? ()
 [2014/10/30 03:52:53.850] #6  0x0000000000f9fda4 in ?? ()

Thread 6: Holds Global (IX), Flush (IX), DB1 (X) and calls into MMAPV1Engine::getDatabaseCatalogEntry. This blocks on the DB catalog entry mutex, which is held by Thread 7.

Flush (Thread 19): Tries to acquire Flush (S) and blocks behind Thread 6.

Thread 7: Holds Global(IX), DB2 (X), DB catalog entry mutex and tries to commit. Commit tries to acquire Flush (IX) back and blocks, because the flush thread is ahead of it in the queue for the Flush lock.

This is a deadlock, which is not discoverable by the deadlock detector, because the DB catalog entry mutex is not a managed lock resource.

We need to ensure that commits are not happening under this lock or that the end of write unit of work does not try to acquire the flush lock again (which might be problematic to do).



 Comments   
Comment by Githook User [ 03/Nov/14 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-15880 MMAP V1 should not create DB under the catalog mutex

DB creation should happen outside of the mutex and only the insertion
should be protected. See the assigned ticket for more details.
Branch: master
https://github.com/mongodb/mongo/commit/9a1d9a2ccea25f8d221408e136bae96906b5c7f0

Comment by Githook User [ 30/Oct/14 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-15880 Delete path should not try to create database
Branch: master
https://github.com/mongodb/mongo/commit/a24843c49eee8cf18c11015cb7411ea1f06aeb10

Comment by Githook User [ 30/Oct/14 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-15880 AutoGetCollectionForRead should always free its collection lock
Branch: master
https://github.com/mongodb/mongo/commit/2008dfd2239f342e0766b60ba7ec65cefdd83cce

Generated at Thu Feb 08 03:39:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.