[SERVER-55373] Tenant migration recipient can deadlock when there is failover Created: 19/Mar/21  Updated: 27/Oct/23  Resolved: 24/Mar/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Cheahuychou Mao Assignee: Backlog - Replication Team
Resolution: Gone away Votes: 0
Labels: pm-1791_non-cloud-blocking
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-52713 [testing] Add stepdown/kill/terminate... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

Based on one of the patch builds for SERVER-52713, the deadlock appears to involve TenantMigrationRecipientService::Instance::_cancelRemainingWork and the ReplicationCoordinator. 

EVG Task 1

EVG Task 2
 



 Comments   
Comment by Jason Chan [ 23/Mar/21 ]

lingzhi.deng, correct, I don't expect the until-block to run inline after the revert of SERVER-54735.

Comment by Lingzhi Deng [ 23/Mar/21 ]

Looks like the TenantOplogApplier deadlocks with itself on shutdown.

#41 0x00007f72fbc6e924 in mongo::future_details::SharedStateBase::transitionToFinished (this=0x5613ffeca180) at src/mongo/util/future_impl.h:456
#42 0x00007f72f314afcf in mongo::future_details::SharedStateBase::setError (statusArg=..., this=0x5613ffeca180) at src/mongo/util/future_impl.h:467
#43 mongo::SharedPromise<mongo::repl::TenantOplogApplier::OpTimePair>::setError (status=..., this=0x561403aa4d70) at src/mongo/util/future.h:1139
#44 mongo::repl::TenantOplogApplier::_finishShutdown (this=0x5613ffda68d0, lk=..., status=...) at src/mongo/db/repl/tenant_oplog_applier.cpp:240
#45 0x00007f72f3138855 in mongo::repl::TenantOplogApplier::_doShutdown_inlock (this=0x5613ffda68d0) at src/mongo/util/concurrency/with_lock.h:100
#46 mongo::repl::TenantOplogApplier::_doShutdown_inlock (this=0x5613ffda68d0) at src/mongo/db/repl/tenant_oplog_applier.cpp:141
#47 0x00007f72f7798a80 in mongo::repl::AbstractAsyncComponent::shutdown (this=this@entry=0x5613ffda68d0) at src/mongo/db/repl/abstract_async_component.cpp:115
#48 0x00007f72f314c9be in mongo::repl::TenantOplogApplier::_shouldStopApplying (this=0x5613ffda68d0, status=Status(InterruptedDueToReplStateChange, "operation was interrupted")) at src/mongo/db/repl/tenant_oplog_applier.cpp:222
#49 0x00007f72f314f230 in mongo::repl::TenantOplogApplier::_applyLoop (this=0x5613ffda68d0, batch=...) at /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.P9L/include/c++/8.3.0/bits/atomic_base.h:512

Fulfilling a TenantOplogApplier promise runs the continuation inline and that in turns tries to shut down the TenantOplogApplier.

#5  mongo::latch_detail::Mutex::lock (this=0x5613ffda69a0) at src/mongo/platform/mutex.cpp:66
#6  0x00007f72f7798a3c in std::lock_guard<mongo::latch_detail::Latch>::lock_guard (__m=..., this=<synthetic pointer>) at /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.P9L/include/c++/8.3.0/bits/std_mutex.h:161
#7  mongo::repl::AbstractAsyncComponent::shutdown (this=0x5613ffda68d0) at src/mongo/db/repl/abstract_async_component.cpp:100
#8  0x00007f72f8c52d28 in mongo::repl::(anonymous namespace)::shutdownTarget<std::shared_ptr<mongo::repl::TenantOplogApplier> > (lk=..., target=std::shared_ptr<mongo::repl::TenantOplogApplier> (use count 3, weak count 1) = {...}) at src/mongo/db/repl/tenant_migration_recipient_service.cpp:1452
#9  mongo::repl::TenantMigrationRecipientService::Instance::_cancelRemainingWork (this=0x5614048be010, lk=...) at src/mongo/db/repl/tenant_migration_recipient_service.cpp:1558
#10 0x00007f72f8c4f538 in mongo::repl::TenantMigrationRecipientService::Instance::<lambda(mongo::Status)>::operator() (__closure=0x561403512e60, status=...) at src/mongo/util/invariant.h:66
#11 mongo::future_util_details::AsyncTryUntilWithDelay<mongo::repl::TenantMigrationRecipientService::Instance::run(std::shared_ptr<mongo::executor::ScopedTaskExecutor>, const mongo::CancelationToken&)::<lambda()>, mongo::repl::TenantMigrationRecipientService::Instance::run(std::shared_ptr<mongo::executor::ScopedTaskExecutor>, const mongo::CancelationToken&)::<lambda(mongo::Status)>, 

I think this was caused by SERVER-54735 which was reverted after the base of the patch build. And essentially this is the same problem as SERVER-55205. I think this is fixed now. cheahuychou.mao, do you know if you have seen this recently?

jason.chan Can you confirm that after the revert of SERVER-54735, we no longer call the until-block of the AsyncTry inline?

Generated at Thu Feb 08 05:36:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.