Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-34942

Stuck with cache full during oplog replay in initial sync

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 3.6.7, 4.0.1, 4.1.2
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
    • Fully Compatible
    • ALL
    • v4.0, v3.6
    • Storage NYC 2018-07-16, Storage NYC 2018-07-30
    • 39

      The oldest timestamp is only advanced at the end of every batch during oplog replay in initial sync. This means that all dirty content generated by the application of the operations in a single batch will be pinned in cache. If the batch is large enough and the operations are heavy enough this dirty content can exceed eviction_dirty_trigger (default 20% of cache) and the rate of applying operations will become dramatically slower because it has to wait for the dirty data to be reduced below the threshold.

      In extreme cases the node can become completely stuck due to full cache preventing a batch from completing and unpinning the data that is keeping the cache full (although I'm not sure if that's a necessary consequence of this or a failure of the lookaside mechanism to keep the node from getting completely stuck.)

      This is similar to SERVER-34938, but I believe oplog application during initial sync is a different codepath from normal replication. If not feel free to close as a dup.

            benety.goh@mongodb.com Benety Goh
            bruce.lucas@mongodb.com Bruce Lucas (Inactive)
            0 Vote for this issue
            17 Start watching this issue