[SERVER-34942] Stuck with cache full during oplog replay in initial sync Created: 10/May/18 Updated: 29/Oct/23 Resolved: 27/Jul/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.7, 4.0.1, 4.1.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Benety Goh |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||||||||||||||||||||||||||
| Sprint: | Storage NYC 2018-07-16, Storage NYC 2018-07-30 | ||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 39 | ||||||||||||||||||||||||||||||||||||
| Description |
|
The oldest timestamp is only advanced at the end of every batch during oplog replay in initial sync. This means that all dirty content generated by the application of the operations in a single batch will be pinned in cache. If the batch is large enough and the operations are heavy enough this dirty content can exceed eviction_dirty_trigger (default 20% of cache) and the rate of applying operations will become dramatically slower because it has to wait for the dirty data to be reduced below the threshold. In extreme cases the node can become completely stuck due to full cache preventing a batch from completing and unpinning the data that is keeping the cache full (although I'm not sure if that's a necessary consequence of this or a failure of the lookaside mechanism to keep the node from getting completely stuck.) This is similar to |
| Comments |
| Comment by Benety Goh [ 27/Jul/18 ] |
|
Reproduction script shows that the stuck cache issue was resolved between 3.6.5 and 3.6.6. There is a reported performance regression reported in |
| Comment by Githook User [ 20/Jul/18 ] |
|
Author: {'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}Message: (cherry picked from commit 2c2427c96848e90129ef10ceb36a0454c2736ab1) |
| Comment by Githook User [ 20/Jul/18 ] |
|
Author: {'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}Message: (cherry picked from commit 2c2427c96848e90129ef10ceb36a0454c2736ab1) |
| Comment by Githook User [ 19/Jul/18 ] |
|
Author: {'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}Message: |
| Comment by Bruce Lucas (Inactive) [ 11/May/18 ] |
|
I think this is essentially the same problem as |
| Comment by Eric Milkie [ 10/May/18 ] |
|
The only way around this that I can see is to be smarter about making smaller batches where there is a potential to dirty a high number of pages. |