Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Critical - P2
Fix Version/s: 3.4.0-rc1
Affects Version/s: 3.3.14
Component/s: Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Sharding 2016-10-10, Sharding 2016-10-31
Linked BF Score:
0
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Interrupting operations at replication stepdown acquires first the Client lock and then the replication coordinator mutex like shown in this thread:

 [2016/09/22 23:14:59.994] (Inline Function) --------`-------- mongod!std::lock_guard<std::mutex>::{ctor} 0x10
 [2016/09/22 23:14:59.994] 00000000`0953d9b0 00000001`4052d8d3 mongod!mongo::repl::ReplicationCoordinatorImpl::interrupt 0x2b
 [2016/09/22 23:14:59.994] 00000000`0953da40 00000001`4052d6e7 mongod!mongo::ServiceContext::killOperation 0x1a3
 [2016/09/22 23:15:00.011] 00000000`0953dad0 00000001`404474ae mongod!mongo::ServiceContext::killAllUserOperations 0xb7
 [2016/09/22 23:15:00.011] 00000000`0953db40 00000001`404472a5 mongod!mongo::repl::ReplicationCoordinatorImpl::stepDown_nonBlocking 0x15e
 [2016/09/22 23:15:00.011] 00000000`0953dca0 00000001`404694bc mongod!mongo::repl::ReplicationCoordinatorImpl::stepDown 0x85
 [2016/09/22 23:15:00.011] 00000000`0953dd20 00000001`3ff7cd76 mongod!mongo::repl::CmdReplSetStepDown::run 0x3dc
 [2016/09/22 23:15:00.011] 00000000`0953def0 00000001`3ff78892 mongod!mongo::Command::run 0x646
 [2016/09/22 23:15:00.011] 00000000`0953e450 00000001`404c1770 mongod!mongo::Command::execCommand 0xb22

Threads, which are about to wait for read or write concern first hold the replication coordinator mutex and then acquire the Client lock, like shown in this thread:

 [2016/09/22 23:14:59.895] (Inline Function) --------`-------- mongod!std::lock_guard<mongo::Client>::{ctor}+0x13
 [2016/09/22 23:14:59.895] (Inline Function) --------`-------- mongod!mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil::__l2::<lambda_cff863bf3f2e525e5f5bbda6979e07b5>::operator()+0x13
 [2016/09/22 23:14:59.895] 00000000`0eb4c2f0 00000001`4018eaf2 mongod!std::condition_variable::wait<<lambda_cff863bf3f2e525e5f5bbda6979e07b5> >+0x33
 [2016/09/22 23:14:59.895] 00000000`0eb4c320 00000001`4018e864 mongod!mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil+0x1e2
 [2016/09/22 23:14:59.895] 00000000`0eb4c3c0 00000001`404490c6 mongod!mongo::OperationContext::waitForConditionOrInterruptNoAssert+0x54
 [2016/09/22 23:14:59.895] 00000000`0eb4c450 00000001`4077a825 mongod!mongo::repl::ReplicationCoordinatorImpl::waitUntilOpTimeForRead+0x426
 [2016/09/22 23:14:59.895] 00000000`0eb4c710 00000001`4076ff92 mongod!mongo::ShardLocal::_exhaustiveFindOnConfig+0x135
 [2016/09/22 23:14:59.910] 00000000`0eb4ca60 00000001`4071ba16 mongod!mongo::Shard::exhaustiveFindOnConfig+0x122
 [2016/09/22 23:14:59.910] 00000000`0eb4cbb0 00000001`40723a04 mongod!mongo::ShardingCatalogClientImpl::_exhaustiveFindOnConfig+0xd6
 [2016/09/22 23:14:59.910] 00000000`0eb4cd40 00000001`406f0955 mongod!mongo::ShardingCatalogClientImpl::getAllShards+0x134
 [2016/09/22 23:14:59.910] 00000000`0eb4d110 00000001`406d194d mongod!mongo::ClusterStatisticsImpl::getStats+0x75
 [2016/09/22 23:14:59.910] 00000000`0eb4d590 00000001`406c48bc mongod!mongo::BalancerChunkSelectionPolicyImpl::checkMoveAllowed+0x5d
 [2016/09/22 23:14:59.910] 00000000`0eb4d9b0 00000001`404da050 mongod!mongo::Balancer::moveSingleChunk+0x7c
 [2016/09/22 23:14:59.910] 00000000`0eb4db10 00000001`3ff7ce3a mongod!mongo::`anonymous namespace'::ConfigSvrMoveChunkCommand::run+0xc0
 [2016/09/22 23:14:59.910] 00000000`0eb4de30 00000001`3ff78892 mongod!mongo::Command::run+0x70a
 [2016/09/22 23:14:59.910] 00000000`0eb4e390 00000001`404c1770 mongod!mongo::Command::execCommand+0xb22

This results in deadlock when replication stepdown tries to interrupt threads. The ordering didn't use to be a problem before this change, which made replication threads use the generic OperationContext interruptable waits.

is depended on by

SERVER-26339 Stepdown waiters are signaled twice when running stepdown command during shutdown

Closed

is related to

SERVER-26345 Make threads waiting for writeConcern use interruptible wait

Closed

Assignee:: Andy Schwerin
Reporter:: Kaloian Manassiev
Participants:: Andy Schwerin, Eric Milkie, Githook User, Kaloian Manassiev
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Sep 23 2016 09:26:35 PM UTC
Updated:: Jan 25 2017 09:57:54 PM UTC
Resolved:: Oct 13 2016 02:19:47 PM UTC
Confidence Status Last Update:: 06/Oct/16 1:56 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates