Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 3.6.24
Affects Version/s: 4.0.0, 4.0.1, 3.6.20, 3.6.22
Component/s: Concurrency, Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Steps To Reproduce:

Hide

I've pushed the scripts I used to reproduce this issue here: https://github.com/jmpesp/mongo_3.6.20_concurrency_bug_repro. All that is required is to perform a replSetReconfig during an operation of any type (the linked repo uses a simple UPDATE).

Show
I've pushed the scripts I used to reproduce this issue here: https://github.com/jmpesp/mongo_3.6.20_concurrency_bug_repro . All that is required is to perform a replSetReconfig during an operation of any type (the linked repo uses a simple UPDATE).
Sprint:
Repl 2021-03-22, Repl 2021-04-05
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The following conditions create a situation where an operation hangs indefinitely:

1. During a replSetReconfig, the variable _currentCommittedSnapshot is set to boost::none in ReplicationCoordinatorImpl::_dropAllSnapshots_inlock.
2. ReplicationCoordinatorImpl::_doneWaitingForReplication_inlock checks !_currentCommittedSnapshot, and returns false if that condition is true.
3. During a replSetReconfig, ReplicationCoordinatorImpl::_wakeReadyWaiters_inlock is called, which signals and removes ThreadWaiters if they are done waiting for replication.

The bug occurs when a ThreadWaiter is removed from the waiter list by the thread performing the replSetReconfig because _doneWaitingForReplication_inlock returns true, but that same thread nulls out _currentCommittedSnapshot, and therefore operation thread's call to _doneWaitingForReplication_inlock returns false. The ThreadWaiter was removed from the list of waiters, and can never be signaled again.

I've detailed this problem in a blog post on my company's engineering blog: https://engineering.vena.io/2021/02/19/what-to-do-when-mongo-3-6-wont-return-your-calls/

I'm opening this against 3.6.20 because the support policy (https://www.mongodb.com/support-policy) shows support extending until April 2021.

I've tested the following patch, and it seems to fix it:

diff --git a/src/mongo/db/repl/replication_coordinator_impl.cpp b/src/mongo/db/repl/replication_coordinator_impl.cpp
index a6a4d0084b..b88b882046 100644
--- a/src/mongo/db/repl/replication_coordinator_impl.cpp
+++ b/src/mongo/db/repl/replication_coordinator_impl.cpp
@@ -1707,6 +1707,17 @@ Status ReplicationCoordinatorImpl::_awaitReplication_inlock(
         if (!stepdownStatus.isOK()) {
             return stepdownStatus;
         }
+
+        // If a replSetReconfig occurred, then all snapshots will be dropped.
+        // `_doneWaitingForReplication_inlock` will fail if there is no current snapshot, and
+        // if this thread's waiter was signaled and removed from the wait list during
+        // replSetReconfig we will enter waitForConditionOrInterruptNoAssertUntil above and
+        // condVar will never be notified.
+        //
+        // If it's null, wait for newly committed snapshot here.
+        while (!_currentCommittedSnapshot) {
+            opCtx->waitForConditionOrInterrupt(_currentCommittedSnapshotCond, *lock);
+        }
     }
 
     return _checkIfWriteConcernCanBeSatisfied_inlock(writeConcern);

But I also believe that cherry picking https://github.com/mongodb/mongo/commit/fe1b92cee5c133e82845ffbd31b25ab5b66084d3 would fix this issue as well (note I haven't tested this). I was able to reproduce this issue on versions 4.0.0 and 4.0.1, but not 4.0.2, and that commit exists between 4.0.1 and 4.0.2.

Assignee:: Vishnu Kaushik
Reporter:: James MacMahon
Participants:: James MacMahon, Vishnu Kaushik
Votes:: 0 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: Feb 19 2021 03:55:51 PM UTC
Updated:: Oct 29 2023 09:57:18 PM UTC
Resolved:: Apr 04 2021 04:47:05 PM UTC
Confidence Status Last Update:: 24/Mar/21 9:06 PM

Details

Description

Attachments

Forms

Activity

People

Dates