Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-34942

Stuck with cache full during oplog replay in initial sync

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major - P3
    • Resolution: Fixed
    • None
    • 3.6.7, 4.0.1, 4.1.2
    • Replication
    • None
    • Fully Compatible
    • ALL
    • v4.0, v3.6
    • Storage NYC 2018-07-16, Storage NYC 2018-07-30
    • 39

    Description

      The oldest timestamp is only advanced at the end of every batch during oplog replay in initial sync. This means that all dirty content generated by the application of the operations in a single batch will be pinned in cache. If the batch is large enough and the operations are heavy enough this dirty content can exceed eviction_dirty_trigger (default 20% of cache) and the rate of applying operations will become dramatically slower because it has to wait for the dirty data to be reduced below the threshold.

      In extreme cases the node can become completely stuck due to full cache preventing a batch from completing and unpinning the data that is keeping the cache full (although I'm not sure if that's a necessary consequence of this or a failure of the lookaside mechanism to keep the node from getting completely stuck.)

      This is similar to SERVER-34938, but I believe oplog application during initial sync is a different codepath from normal replication. If not feel free to close as a dup.

      Attachments

        Issue Links

          Activity

            People

              benety.goh@mongodb.com Benety Goh
              bruce.lucas@mongodb.com Bruce Lucas
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: