[SERVER-34938] Secondary slowdown or hang due to content pinned in cache by single oplog batch Created: 10/May/18 Updated: 29/Oct/23 Resolved: 01/Apr/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.6.3, 3.6.4 |
| Fix Version/s: | 5.0.0-rc0, 4.2.16, 4.0.27, 4.4.9 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Moustafa Maher |
| Resolution: | Fixed | Votes: | 6 |
| Labels: | dmd-perf, former-quick-wins | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Minor Change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.9, v4.4, v4.2, v4.0, v3.6
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2021-03-08, Repl 2021-04-05 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
We only advance the oldest timestamp at oplog batch boundaries. This means that all dirty content generated by the application of the operations in a single batch will be pinned in cache. If the batch is large enough and the operations are heavy enough this dirty content can exceed eviction_dirty_trigger (default 20% of cache) and the rate of applying operations will become dramatically slower because it has to wait for the dirty data to be reduced below the threshold. This can be triggered by a momentary slowdown on a secondary causing it to lag momentarily, so the next batch it processes will be unusually large, causing it to exceed 20% dirty cache. This will cause it to lag even further, so the next batch will be even larger, and so on. In extreme cases the node can become completely stuck due to full cache preventing a batch from completing and unpinning the data that is keeping the cache full. This can also occur if a secondary is offline for maintenance; when it comes back online and begins to catch up, it will be processing large batches that risk exceeding the dirty trigger threshold, so it may apply operations at a much slower rate than a secondary that is keeping up and processing operations in small batches. |
| Comments |
| Comment by Githook User [ 30/Aug/21 ] |
|
Author: {'name': 'Moustafa Maher', 'email': 'm.maher@10gen.com', 'username': 'moustafamaher'}Message: |
| Comment by Githook User [ 30/Aug/21 ] |
|
Author: {'name': 'Moustafa Maher', 'email': 'm.maher@10gen.com', 'username': 'moustafamaher'}Message: |
| Comment by Githook User [ 04/Aug/21 ] |
|
Author: {'name': 'Moustafa Maher', 'email': 'm.maher@10gen.com', 'username': 'moustafamaher'}Message: |
| Comment by Githook User [ 01/Apr/21 ] |
|
Author: {'name': 'Moustafa Maher', 'email': 'm.maher@10gen.com', 'username': 'moustafamaher'}Message: |
| Comment by Alexander Gorrod [ 20/Jan/21 ] |
It is unlikely that durable history will help in this case. The transaction(s) being used to apply the batch of oplog entries are pinning the cache, and even with durable history, WiredTiger can't evict content related to active transactions.
i.e: I don't believe the issue occurs with a combination of committed and uncommitted data that can't be evicted. It is that there is too much uncommitted data outright. |
| Comment by Bruce Lucas (Inactive) [ 12/Dec/18 ] |
|
Attached a repro, batchsize-1.{sh,js}, that
|
| Comment by Bruce Lucas (Inactive) [ 17/Jul/18 ] |
|
Note that a side effect of this issue can be unbounded lookaside table growth, which results in unbounded growth in WiredTigerLAS.wt. |
| Comment by Bruce Lucas (Inactive) [ 30/May/18 ] |
|
The reproducer on |
| Comment by Spencer Brody (Inactive) [ 23/May/18 ] |
|
FYI, there is a setParameter currently, replBatchLimitOperations, which controls the max number of oplog entries in a singe replication batch. If users are hitting this issue, they can consider turning that value down from its default of 50,000 to produce smaller secondary batches. |
| Comment by Bruce Lucas (Inactive) [ 18/May/18 ] |
|
We need to consider the impact of replicated index builds on this issue. The index build will generate some pinned cache content during the application of a batch, but won't be included in any accounting of the direct impact of the batch. |
| Comment by Bruce Lucas (Inactive) [ 11/May/18 ] |
|
Summarizing some offline conversations with schwerin and michael.cahill
|
| Comment by Bruce Lucas (Inactive) [ 11/May/18 ] |
|
I'm not sure that's sufficient. In some customer workloads we've seen a single large batch delay things for a very long time, particularly in the situation where we're catching up from a large lag. How about something like the following:
|
| Comment by Andy Schwerin [ 11/May/18 ] |
|
We can easily reduce the maximum batch size as measured in count of oplog entries or bytes of oplog entries, but that may be a poor proxy for cache pressure caused by a batch. Would it be enough to reduce the batch size for subsequent batches during periods of high cache pressure? |
| Comment by Bruce Lucas (Inactive) [ 10/May/18 ] |
|
Note I'm not sure whether the "node can become completely stuck due to full cache" part of the problem description is a necessary consequence of this issue, or whether the lookaside mechanism is meant to prevent that and limit the consequences of this issue to an extreme slowdown. If so there may be an additional ticket (but fixing lookaside won't fix the extreme performance consequences of this issue). |