Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.2.15, 4.4.7, 5.0.0-rc0, 4.0.26
Affects Version/s: None
Component/s: Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Backport Requested:

v4.9, v4.4, v4.2, v4.0, v3.6
Sprint:
Repl 2021-05-03
Case:
Linked BF Score:
5
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

~~SERVER-54805~~ describes a case where the replication machinery on replica set secondaries ceases to make progress, with the symptom being that all threads in the replication writer thread pool are idle and the thread driving secondary replication is simultaneously blocked waiting for those writer threads to finish their work.

So far, this behavior has only manifest on systems with glibc versions susceptible to this glibc pthread condition variable bug. While I have not been able to build a minimal reproduction using our ThreadPool type, the scenario proven to exist in this blog post about using TLA+ to model glibc condition variables is perfectly analogous to how replication uses thread pools. In this scenario, a signal delivery that is lost due to the glibc bug leads to incomplete work being left in the thread pool, and no threads waking up to perform the work.

Fortunately, a low-risk workaround for this bug as it manifests in the replication system's use of ThreadPool exists. By setting minThreads to 0 instead of its current value, which is equal to maxThreads, we ensure that all waits performed by worker threads eventually wake up due to expiration of the idle thread timeout.

The task in this ticket is to change the value of minThreads in the writer thread pool used by replication to 0. This will not eliminate all possible failures due to the glibc bug, but it will eliminate the only one we've seen in practice until such time as the bug in glibc is corrected.

is depended on by

SERVER-54805 Mongo become unresponsive, Spike in Connections and FD

Closed

is duplicated by

SERVER-54805 Mongo become unresponsive, Spike in Connections and FD

Closed

SERVER-60164 db.serverStatus() hang on SECONDARY server #secondtime

Closed

SERVER-63402 High query response time for find operation in mongo 4.0.27 with mmap storage engine with random intervals (5/7/12/20 hours)

Closed

is related to

SERVER-56784 The replication thread of secondary hang up

Closed

SERVER-92554 Consider lowering maxIdleThreadAge for oplog applier thread pool

Open

SERVER-92557 Add better diagnostics to identify cases of lost condition variable signal in oplog applier thread pool

Open

(2 is related to)

Assignee:: Lingzhi Deng
Reporter:: Andy Schwerin
Participants:: Andy Schwerin, Githook User, Lingzhi Deng, venkataramans rama
Votes:: 0 Vote for this issue
Watchers:: 22 Start watching this issue

Created:: Apr 12 2021 08:02:02 PM UTC
Updated:: Jul 29 2024 06:44:22 PM UTC
Resolved:: Apr 29 2021 01:52:24 AM UTC
Confidence Status Last Update:: 26/Apr/21 5:54 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates