[SERVER-28840] replSetSyncFrom causes InitialSyncer and ReplicationCoordinator to acquire each other's mutexes in opposite orders Created: 18/Apr/17  Updated: 27/Oct/23  Resolved: 18/Jun/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Coverity Collector User Assignee: Backlog - Replication Team
Resolution: Gone away Votes: 0
Labels: coverity, neweng, syncSourceSelection
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-28859 Coverity analysis defect 101487: Thre... Closed
is duplicated by SERVER-28886 Coverity analysis defect 101486: Thre... Closed
Related
related to SERVER-31487 Replace replSetSyncFrom resync option... Closed
is related to SERVER-34758 replSetGetStatus can deadlock with in... Closed
is related to SERVER-35372 replSetSyncFrom can cause deadlock be... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

This issue was originally discovered by the Coverity Static Analysis tool.

Consider the following lock acquisitions in InitialSyncer and ReplicationCoordinatorImpl:

ReplicationCoordinatorImpl::processReplSetSyncFrom
  1. Acquire ReplicationCoordinatorImpl::_mutex code
  2. Acquire InitialSyncer::_mutex code
InitialSyncer::_multiApplierCallback
  1. Acquire InitialSyncer::_mutex code
  2. Acquire ReplicationCoordinatorImpl::_mutex code

Since these two functions acquire the same two locks but in reverse orders, it creates the potential for a deadlock, if each of these functions are running concurrently. One way to fix this would be to stop InitialSyncer from updating the optime of the ReplicationCoordinator on every batch. Alternatively, the _multiApplierCallback could call the _opts.setLastOpTime outside of holding it's own mutex, since it doesn't seem necessary to synchronize access to the InitialSyncer::_lastApplied after it's been written to in that function.

This issue also occurs in InitialSyncer::_getNextApplierBatchCallback, which acquires the InitialSyncer mutex, and then tries to acquire ReplicationCoordinator's mutex when calling _opts.getSlaveDelay().


Original Coverity Report Message:

Defect 100780 (STATIC_C)
Checker ORDER_REVERSAL (subcategory none)
File: /src/mongo/db/repl/replication_coordinator_impl.cpp
Function mongo::repl::ReplicationCoordinatorImpl::processReplSetSyncFrom(mongo::OperationContext *, const mongo::HostAndPort &, mongo::BSONObjBuilder *)



 Comments   
Comment by William Schultz (Inactive) [ 18/Jun/18 ]

Will be fixed by SERVER-31487.

Comment by Spencer Brody (Inactive) [ 05/Jun/18 ]

This can likely be closed once SERVER-31487 is implemented

Comment by William Schultz (Inactive) [ 06/Jul/17 ]

The issue Coverity picked up on has to deal with inconsistent lock acquisition order, involving the InitialSyncer::_mutex and the ReplicationCoordinatorImpl::_mutex, leading to a potential deadlock. Consider the following two functions, and the order that they acquire locks:

ReplicationCoordinatorImpl::processReplSetSyncFrom
  1. Acquire ReplicationCoordinatorImpl::_mutex code
  2. Acquire InitialSyncer::_mutex code
InitialSyncer::_multiApplierCallback
  1. Acquire InitialSyncer::_mutex code
  2. Acquire ReplicationCoordinatorImpl::_mutex code

Since these two functions acquire the same two locks but in reverse orders, it creates the potential for a deadlock, if each of these functions are running on separate threads and try to acquire the locks with a specific thread interleaving. The scenario to actually trigger this deadlock is probably quite unlikely: the replSetSyncFrom command would have to be processed during the execution of _multiApplierCallback, and the interleaving of threads would have to be such that it induced a deadlock. To fix this issue, we could review the lock acquisition order for this pair of locks and see if we are able to make it consistent throughout the codebase, to avoid these kinds of potential deadlocks.

Note: Through some manual testing, I was actually able to trigger this deadlock externally by adding an extra sleep between the _multiApplierCallback's acquisition of the InitialSyncer::_mutex and the ReplicationCoordinatorImpl::_mutex and then running an initial sync and calling the replSetSyncFrom command at the correct time during its execution.

Generated at Thu Feb 08 04:19:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.