Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.0.0-rc0, 4.2.16, 4.0.27, 4.4.9
Affects Version/s: 3.6.3, 3.6.4
Component/s: Replication
Labels:
- dmd-perf
- former-quick-wins

Backwards Compatibility:
Minor Change
Operating System:
ALL
Backport Requested:

v4.9, v4.4, v4.2, v4.0, v3.6
Sprint:
Repl 2021-03-08, Repl 2021-04-05
Case:
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We only advance the oldest timestamp at oplog batch boundaries. This means that all dirty content generated by the application of the operations in a single batch will be pinned in cache. If the batch is large enough and the operations are heavy enough this dirty content can exceed eviction_dirty_trigger (default 20% of cache) and the rate of applying operations will become dramatically slower because it has to wait for the dirty data to be reduced below the threshold.

This can be triggered by a momentary slowdown on a secondary causing it to lag momentarily, so the next batch it processes will be unusually large, causing it to exceed 20% dirty cache. This will cause it to lag even further, so the next batch will be even larger, and so on. In extreme cases the node can become completely stuck due to full cache preventing a batch from completing and unpinning the data that is keeping the cache full.

This can also occur if a secondary is offline for maintenance; when it comes back online and begins to catch up, it will be processing large batches that risk exceeding the dirty trigger threshold, so it may apply operations at a much slower rate than a secondary that is keeping up and processing operations in small batches.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

batchsize-1.js
0.5 kB
Dec 12 2018 12:53:19 PM UTC
batchsize-1.sh
2 kB
Dec 12 2018 12:52:55 PM UTC

is depended on by

SERVER-35958 Big CPU load increase (×4) on secondary by upgrading 3.4.15 → 3.6.5

Closed

is duplicated by

SERVER-35339 Complete recovery failure after unclean shutdown

Closed

is related to

SERVER-36495 Cache pressure issues during recovery oplog application

Closed

SERVER-36496 Cache pressure issues during oplog replay in initial sync

Closed

SERVER-35405 Change default setting for replBatchLimitOperations

Closed

related to

SERVER-37849 Poor replication performance and cache-full hang on secondary due to pinned content

Backlog

SERVER-33191 Cache-full hangs on 3.6

Closed

SERVER-34941 Add testing to cover cases where timestamps cause cache pressure

Closed

SERVER-34942 Stuck with cache full during oplog replay in initial sync

Closed

SERVER-35103 Checkpoint creates unevictable clean content

Closed

SERVER-35191 Stuck with cache full during rollback

Closed

SERVER-107738 Consider making oplog applier threads configured based on number of cores by default

Open

SERVER-115636 Allow configuration of more than 2x cores replica threads (oplog applier threads)

Closed

mentioned in: Page Loading...

(8 related to, 1 mentioned in)

Assignee:: Moustafa Maher
Reporter:: Bruce Lucas (Inactive)
Participants:: Alexander Gorrod, Andy Schwerin, Bruce Lucas, Githook User, Moustafa Maher, Spencer Brody
Votes:: 6 Vote for this issue
Watchers:: 78 Start watching this issue

Created:: May 10 2018 07:50:04 PM UTC
Updated:: Jan 05 2026 07:22:45 PM UTC
Resolved:: Apr 01 2021 11:49:08 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

PagerDuty