[SERVER-34942] Stuck with cache full during oplog replay in initial sync Created: 10/May/18  Updated: 29/Oct/23  Resolved: 27/Jul/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.6.7, 4.0.1, 4.1.2

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Benety Goh
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
Related
related to SERVER-33191 Cache-full hangs on 3.6 Closed
related to SERVER-36238 replica set startup fails in wt_cache... Closed
is related to SERVER-34900 initial sync uses different batch lim... Closed
is related to SERVER-34938 Secondary slowdown or hang due to con... Closed
is related to SERVER-36496 Cache pressure issues during oplog re... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: Storage NYC 2018-07-16, Storage NYC 2018-07-30
Participants:
Case:
Linked BF Score: 39

 Description   

The oldest timestamp is only advanced at the end of every batch during oplog replay in initial sync. This means that all dirty content generated by the application of the operations in a single batch will be pinned in cache. If the batch is large enough and the operations are heavy enough this dirty content can exceed eviction_dirty_trigger (default 20% of cache) and the rate of applying operations will become dramatically slower because it has to wait for the dirty data to be reduced below the threshold.

In extreme cases the node can become completely stuck due to full cache preventing a batch from completing and unpinning the data that is keeping the cache full (although I'm not sure if that's a necessary consequence of this or a failure of the lookaside mechanism to keep the node from getting completely stuck.)

This is similar to SERVER-34938, but I believe oplog application during initial sync is a different codepath from normal replication. If not feel free to close as a dup.



 Comments   
Comment by Benety Goh [ 27/Jul/18 ]

Reproduction script shows that the stuck cache issue was resolved between 3.6.5 and 3.6.6. There is a reported performance regression reported in SERVER-36221 that we will continue to investigate.

Comment by Githook User [ 20/Jul/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-34942 add test to fill wiredtiger cache during initial sync oplog replay

(cherry picked from commit 2c2427c96848e90129ef10ceb36a0454c2736ab1)
Branch: v3.6
https://github.com/mongodb/mongo/commit/68b3d2b7945ed8057c26e0507969009d7217d097

Comment by Githook User [ 20/Jul/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-34942 add test to fill wiredtiger cache during initial sync oplog replay

(cherry picked from commit 2c2427c96848e90129ef10ceb36a0454c2736ab1)
Branch: v4.0
https://github.com/mongodb/mongo/commit/a467195cae7046e08ee1f09926dc072c16edfd30

Comment by Githook User [ 19/Jul/18 ]

Author:

{'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}

Message: SERVER-34942 add test to fill wiredtiger cache during initial sync oplog replay
Branch: master
https://github.com/mongodb/mongo/commit/2c2427c96848e90129ef10ceb36a0454c2736ab1

Comment by Bruce Lucas (Inactive) [ 11/May/18 ]

I think this is essentially the same problem as SERVER-34938, and there's some discussion over there about a solution. I opened this as a separate ticket because the manifestation is different and it's helpful to have tickets focused on symptoms, and also I think the code paths are different.

Comment by Eric Milkie [ 10/May/18 ]

The only way around this that I can see is to be smarter about making smaller batches where there is a potential to dirty a high number of pages.

Generated at Thu Feb 08 04:38:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.