[SERVER-34938] Secondary slowdown or hang due to content pinned in cache by single oplog batch Created: 10/May/18  Updated: 29/Oct/23  Resolved: 01/Apr/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.6.3, 3.6.4
Fix Version/s: 5.0.0-rc0, 4.2.16, 4.0.27, 4.4.9

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Moustafa Maher
Resolution: Fixed Votes: 6
Labels: dmd-perf, former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File batchsize-1.js     File batchsize-1.sh    
Issue Links:
Backports
Depends
is depended on by SERVER-35958 Big CPU load increase (×4) on seconda... Closed
Documented
is documented by DOCS-15433 [SERVER] Secondary slowdown or hang d... Closed
Duplicate
is duplicated by SERVER-35339 Complete recovery failure after uncle... Closed
Problem/Incident
Related
related to SERVER-37849 Poor replication performance and cach... Backlog
related to SERVER-33191 Cache-full hangs on 3.6 Closed
related to SERVER-34941 Add testing to cover cases where time... Closed
related to SERVER-34942 Stuck with cache full during oplog re... Closed
related to SERVER-35103 Checkpoint creates unevictable clean ... Closed
related to SERVER-35191 Stuck with cache full during rollback Closed
is related to SERVER-36495 Cache pressure issues during recovery... Closed
is related to SERVER-36496 Cache pressure issues during oplog re... Closed
is related to SERVER-35405 Change default setting for replBatchL... Closed
Backwards Compatibility: Minor Change
Operating System: ALL
Backport Requested:
v4.9, v4.4, v4.2, v4.0, v3.6
Sprint: Repl 2021-03-08, Repl 2021-04-05
Participants:
Case:

 Description   

We only advance the oldest timestamp at oplog batch boundaries. This means that all dirty content generated by the application of the operations in a single batch will be pinned in cache. If the batch is large enough and the operations are heavy enough this dirty content can exceed eviction_dirty_trigger (default 20% of cache) and the rate of applying operations will become dramatically slower because it has to wait for the dirty data to be reduced below the threshold.

This can be triggered by a momentary slowdown on a secondary causing it to lag momentarily, so the next batch it processes will be unusually large, causing it to exceed 20% dirty cache. This will cause it to lag even further, so the next batch will be even larger, and so on. In extreme cases the node can become completely stuck due to full cache preventing a batch from completing and unpinning the data that is keeping the cache full.

This can also occur if a secondary is offline for maintenance; when it comes back online and begins to catch up, it will be processing large batches that risk exceeding the dirty trigger threshold, so it may apply operations at a much slower rate than a secondary that is keeping up and processing operations in small batches.



 Comments   
Comment by Githook User [ 30/Aug/21 ]

Author:

{'name': 'Moustafa Maher', 'email': 'm.maher@10gen.com', 'username': 'moustafamaher'}

Message: SERVER-34938 Secondary slowdown or hang due to content pinned in cache by single oplog batch
Branch: v4.0
https://github.com/mongodb/mongo/commit/d47b151b55f286546e7c7c98888ae0577856ca20

Comment by Githook User [ 30/Aug/21 ]

Author:

{'name': 'Moustafa Maher', 'email': 'm.maher@10gen.com', 'username': 'moustafamaher'}

Message: SERVER-34938 Secondary slowdown or hang due to content pinned in cache by single oplog batch
Branch: v4.2
https://github.com/mongodb/mongo/commit/8a5c1f52a77bc92320c90ecbe5424e841874296e

Comment by Githook User [ 04/Aug/21 ]

Author:

{'name': 'Moustafa Maher', 'email': 'm.maher@10gen.com', 'username': 'moustafamaher'}

Message: SERVER-34938 Secondary slowdown or hang due to content pinned in cache by single oplog batch
Branch: v4.4
https://github.com/mongodb/mongo/commit/ac3ff8214bf42119945d49d183d5410bd81a2913

Comment by Githook User [ 01/Apr/21 ]

Author:

{'name': 'Moustafa Maher', 'email': 'm.maher@10gen.com', 'username': 'moustafamaher'}

Message: SERVER-34938 Secondary slowdown or hang due to content pinned in cache by single oplog batch
Branch: master
https://github.com/mongodb/mongo/commit/39e9990a2fe1b72e5d776a86651469894d0d72af

Comment by Alexander Gorrod [ 20/Jan/21 ]

Is it correct to assume that Durable History will make this problem go away?

 

It is unlikely that durable history will help in this case. The transaction(s) being used to apply the batch of oplog entries are pinning the cache, and even with durable history, WiredTiger can't evict content related to active transactions.

 

i.e: I don't believe the issue occurs with a combination of committed and uncommitted data that can't be evicted. It is that there is too much uncommitted data outright.

Comment by Bruce Lucas (Inactive) [ 12/Dec/18 ]

Attached a repro, batchsize-1.{sh,js}, that

  • starts a workload that generates a large amount of dirty cache content per replicated op
  • after 60 seconds, we slow one of the secondaries for 30 seconds by stopping it for 0.8 seconds out of every second, generating some lag
  • after that 30 second slowdown, that secondary is stuck in a low-performance state permanently or for an extended time so it's not able to catch up
Comment by Bruce Lucas (Inactive) [ 17/Jul/18 ]

Note that a side effect of this issue can be unbounded lookaside table growth, which results in unbounded growth in WiredTigerLAS.wt.

Comment by Bruce Lucas (Inactive) [ 30/May/18 ]

The reproducer on SERVER-33191 also reproduces this issue, as the hang that it reproduces is triggered by the cache pressure caused by this issue.

Comment by Spencer Brody (Inactive) [ 23/May/18 ]

FYI, there is a setParameter currently, replBatchLimitOperations, which controls the max number of oplog entries in a singe replication batch. If users are hitting this issue, they can consider turning that value down from its default of 50,000 to produce smaller secondary batches.
Note however that this is a limit on number of documents in a batch, which may not be a great proxy for size of the batch in terms of MB of oplog entries or in terms of number of pages touched by applying the batch, therefore it's difficult to directly map how changes to this parameter will affect changes to cache pressure associated with applying batches.

Comment by Bruce Lucas (Inactive) [ 18/May/18 ]

We need to consider the impact of replicated index builds on this issue. The index build will generate some pinned cache content during the application of a batch, but won't be included in any accounting of the direct impact of the batch.

Comment by Bruce Lucas (Inactive) [ 11/May/18 ]

Summarizing some offline conversations with schwerin and michael.cahill

  • the point of the markers above was to avoid adding a field to every oplog entry, assuming that may be difficult. If it's possible to add a field to every oplog entry that's simpler.
  • any measure used to adaptively adjust the size of the batches on the primary should be equally applicable to all secondaries. Something based on actual page sizes might not satisfy that as such details could differ on the secondaries.
  • a simple measure based on number of pages touched might be a sufficient rough approximation
  • an even simpler approach of just limiting the maximum batch size might be sufficient, and more suitable for backport
Comment by Bruce Lucas (Inactive) [ 11/May/18 ]

I'm not sure that's sufficient. In some customer workloads we've seen a single large batch delay things for a very long time, particularly in the situation where we're catching up from a large lag.

How about something like the following:

  • estimate cache pressure or potential cache pressure created by each op on the primary. This could be done maybe by
    • having the storage engine to give an accounting or estimate of pinned content or potential pinned content (after reconciliation) for each op
    • using number of pages touched (maybe number of inserts and deletes) as a rough proxy
    • every so many ops querying the storage engine for statistics related to pinned content
  • every time this estimate crosses some fixed boundary, e.g. 5% of cache size, insert a marker in the oplog
  • on the secondaries never include more than one such marker in a batch (allowing one ensures we don't get tiny batches near boundaries, if that's important)
Comment by Andy Schwerin [ 11/May/18 ]

We can easily reduce the maximum batch size as measured in count of oplog entries or bytes of oplog entries, but that may be a poor proxy for cache pressure caused by a batch.

Would it be enough to reduce the batch size for subsequent batches during periods of high cache pressure?

Comment by Bruce Lucas (Inactive) [ 10/May/18 ]

Note I'm not sure whether the "node can become completely stuck due to full cache" part of the problem description is a necessary consequence of this issue, or whether the lookaside mechanism is meant to prevent that and limit the consequences of this issue to an extreme slowdown. If so there may be an additional ticket (but fixing lookaside won't fix the extreme performance consequences of this issue).

Generated at Thu Feb 08 04:38:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.