Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-32513

Initial sync unnecessarily throws away oplog entries

    • Type: Icon: Improvement Improvement
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Replication

      Initial sync follows the following phases:
      1) Get the "initial sync begin timestamp" (B). In 4.0 and earlier this is the most recent oplog entry on the sync source. In 4.2 we will get both the "initial sync fetch begin timestamp" (Bf) which will equal the oldest active transaction oplog entry on the sync source and the "initial sync apply begin timestamp" (Ba) which will be the most recent oplog entry on the sync source.
      2) Start fetching oplog entries from B (or in 4.2 the Bf). Whenever an oplog entry is fetched, it is inserted into an uncapped local collection.
      3) Clone all data, simultaneously creating indexes as we clone each collection
      4) Get the "initial sync end timestamp" (E), which will be the most recent oplog entry on the sync source
      5) Start applying oplog entries from B in 4.0 or earlier or Ba in 4.2+. When applying an oplog entry, it also gets written into the real, capped oplog.
      6) As we apply oplog entries, if we try to apply an update but do not have a local version of the document to update we fetch that document from the sync source, and get a new "initial sync end timestamp" by fetching the most recent oplog entry on the sync source again.
      7) Stop both fetching (we've been fetching this entire time, and have generally fetched much more oplog than is necessary, say the last oplog entry fetched was at time F, such that F>E) and applying when we apply up to the most recently set value for E ("initial sync end timestamp").
      8) Drop the uncapped local collection
      9) Leave initial sync and begin fetching from E

      At the end of initial sync, the extra oplog entries in our oplog buffer (from E to F above), are simply thrown away instead of transferring them to the steady state oplog buffer. By beginning fetching immediately and buffering fetched oplog entries in a collection only capped by the size of the disk on the initial syncing node, the initial sync itself should almost never fail due to falling off the back of the sync source oplog. This would only ever happen if the sync source was writing to the oplog faster than the initial syncing node could fetch oplog entries and write them to a local collection without even applying them.

      However, consider if we fetch E at wall-clock time A1 and complete initial sync at time A2 (so we fetch F at time A2). We then throw away all oplog entries that we fetched from E to F between wall-clock times A1 to A2. We then have to refetch all oplog entries from E to F. Thus at wall-clock time A2 we must be able to fetch oplog entry E when the sync source has written all of the way to F already. This means that if in between wall-clock times A1 and A2, the sync source rolled over its oplog and threw away E for being too old, the initial syncing node will be unable to fetch from its sync source immediately after leaving initial sync. As a result, the minimum amount of oplog required is E to F in this case, or the amount of oplog written between A1 and A2 in terms of wall-clock time if the oplog is growing at a steady rate. Since this rate is pretty hard to calculate and that would be cutting it close, some sync source oplog size significantly larger than E-F is advisable.

      Now that the storage engine allows us to truncate the oldest oplog entries asynchronously when we're ready (rather than mmap which truly had a fixed size), we are able to write all oplog entries into the real, capped oplog during initial sync by instructing the storage engine to ignore the cap during initial sync, and then slowly shrink the oplog back to its desired size as we apply oplog entries and catch up to the primary.

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            judah.schvimer@mongodb.com Judah Schvimer
            Votes:
            10 Vote for this issue
            Watchers:
            39 Start watching this issue

              Created:
              Updated:
              Resolved: