Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.6.1
Component/s: Replication, Stability
Labels:
- neweng
- shutdown

Assigned Teams:

Replication
Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Note: this deadlock is similar to ~~SERVER-28688~~ but this is another one.
Note: I observed this deadlock in 3.6.1.

ReplicationCoordinatorExternalStateImpl::shutdown calls _taskExecutor->join() while having _threadMutex locked. In most cases there are no tasks for worker threads and _taskExecutor->join() returns immediately. But in some rare situations DropPendingCollectionReaper has some collections to drop and while these tasks are running signal processing thread keeps _threadMutex locked. If at this moment replication logic decides to stepdown then we have a deadlock because ReplicationCoordinatorExternalStateImpl::startProducerIfStopped tries to acquire _threadMutex while holding the global exclusive lock. After startProducerIfStopped starts its wait for _threadMutex drop collection tasks are also blocked by the global lock.

Attached file contains output of mongodb-waitsfor-graph, mongodb-show-locks, mongodb-uniqstack commands. In this file:

thread 2 (signalProcessingThread) owns _threadMutex lock (acquired in ReplicationCoordinatorExternalStateImpl::shutdown)
and waits for shutdown of worker threads (_taskExecutor->shutdown(); _taskExecutor->join()

thread 47: "replexec-9" waits for _threadMutex (owned by thread 2)
is processing _stepDownFinish event
which calls _updateMemberStateFromTopologyCoordinator_inlock
which calls startProducerIfStopped
which tries to aquire _threadMutex

thread 48 (worker thread executing dropCollection task)
waits for global lock owned by thread 47

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

bugrep.txt
86 kB
Feb 08 2018 12:55:55 PM UTC

duplicates

SERVER-36873 ReplicationCoordinatorExternalStateImpl::shutdown() must not hold _threadMutex while waiting for _taskExecutor

Closed

related to

SERVER-28688 Deadlock between shutdown and stepdown

Closed

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Igor Solodovnikov
Participants:: [DO NOT USE] Backlog - Replication Team, Benety Goh, Igor Solodovnikov
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Feb 08 2018 12:56:06 PM UTC
Updated:: Dec 06 2022 03:38:42 AM UTC
Resolved:: Jan 28 2019 09:19:51 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates