Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- repl-shortlist

Assigned Teams:

Replication
Sprint:
Repl 2024-10-28, Repl 2024-11-11, Repl 2024-12-09, Repl 2024-12-23
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

In recent production cases of replication lag, we've seen all secondaries slowed by what looks to be replication coordinator mutex contention. One of the symptoms include a "Scheduling heartbeat to fetch newer config" log line repeated every ~500 ms on each secondary. The primary had just been elected, and we suspect the secondaries were failing to retrieve the new config with a higher config version and term.

The root cause was not immediately clear, but we suspect was that it had to do with a buildup of replication heartbeats. We saw a linear increase in the number of replSetHeartbeat commands on the primary, with the maximum hitting ~20,000 commands/s. The flame graphs indicated 50 threads spending time in heartbeat code on each secondary.

Whenever the secondary receives a heartbeat from a primary with a new config version and term, it'll schedule a heartbeat to fetch the new config. The task is scheduled on the replication thread pool for immediate execution. There are 50 threads here, which corresponds with the 50 threads from the flame graph. Notably, sending a heartbeat takes the replication coordinator mutex.

Our theory is that somehow, a network mishap on the primary rendered the secondaries unable to complete the heartbeat reconfig. However, the secondaries were still receiving heartbeats from the primary, and on each heartbeat, we scheduled a new heartbeat task on the executor, adding to the _heartbeatHandles vector.

This all remains a theory so far, so this ticket's scope is to investigate via code inspection and my starter reproducer script in the linked ticket. If this is possible, we should attempt to prune the _heartbeatHandles list, similar to what we did for the replication waiter list.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

out.txt
1.50 MB
Oct 25 2024 06:58:41 PM UTC

related to

SERVER-95633 Expose the number of actively queued heartbeats in serverStatus

Closed

SERVER-96256 Use more appropriate container for queued heartbeat handles

Closed

Assignee:: Solomon Lifshits
Reporter:: Ali Mir
Participants:: Ali Mir, Solomon Lifshits
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Oct 16 2024 09:26:41 PM UTC
Updated:: Dec 09 2024 11:08:20 PM UTC
Resolved:: Dec 09 2024 11:08:20 PM UTC
Confidence Status Last Update:: 06/Nov/24 7:33 PM

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates