[SERVER-32513] Initial sync unnecessarily throws away oplog entries Created: 02/Jan/18  Updated: 22/Aug/23  Resolved: 22/Aug/23

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Judah Schvimer Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 10
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
Related
related to SERVER-33866 Decouple oplog fetching from oplog ap... Closed
Assigned Teams:
Replication
Participants:
Case:

 Description   

Initial sync follows the following phases:
1) Get the "initial sync begin timestamp" (B). In 4.0 and earlier this is the most recent oplog entry on the sync source. In 4.2 we will get both the "initial sync fetch begin timestamp" (Bf) which will equal the oldest active transaction oplog entry on the sync source and the "initial sync apply begin timestamp" (Ba) which will be the most recent oplog entry on the sync source.
2) Start fetching oplog entries from B (or in 4.2 the Bf). Whenever an oplog entry is fetched, it is inserted into an uncapped local collection.
3) Clone all data, simultaneously creating indexes as we clone each collection
4) Get the "initial sync end timestamp" (E), which will be the most recent oplog entry on the sync source
5) Start applying oplog entries from B in 4.0 or earlier or Ba in 4.2+. When applying an oplog entry, it also gets written into the real, capped oplog.
6) As we apply oplog entries, if we try to apply an update but do not have a local version of the document to update we fetch that document from the sync source, and get a new "initial sync end timestamp" by fetching the most recent oplog entry on the sync source again.
7) Stop both fetching (we've been fetching this entire time, and have generally fetched much more oplog than is necessary, say the last oplog entry fetched was at time F, such that F>E) and applying when we apply up to the most recently set value for E ("initial sync end timestamp").
8) Drop the uncapped local collection
9) Leave initial sync and begin fetching from E

At the end of initial sync, the extra oplog entries in our oplog buffer (from E to F above), are simply thrown away instead of transferring them to the steady state oplog buffer. By beginning fetching immediately and buffering fetched oplog entries in a collection only capped by the size of the disk on the initial syncing node, the initial sync itself should almost never fail due to falling off the back of the sync source oplog. This would only ever happen if the sync source was writing to the oplog faster than the initial syncing node could fetch oplog entries and write them to a local collection without even applying them.

However, consider if we fetch E at wall-clock time A1 and complete initial sync at time A2 (so we fetch F at time A2). We then throw away all oplog entries that we fetched from E to F between wall-clock times A1 to A2. We then have to refetch all oplog entries from E to F. Thus at wall-clock time A2 we must be able to fetch oplog entry E when the sync source has written all of the way to F already. This means that if in between wall-clock times A1 and A2, the sync source rolled over its oplog and threw away E for being too old, the initial syncing node will be unable to fetch from its sync source immediately after leaving initial sync. As a result, the minimum amount of oplog required is E to F in this case, or the amount of oplog written between A1 and A2 in terms of wall-clock time if the oplog is growing at a steady rate. Since this rate is pretty hard to calculate and that would be cutting it close, some sync source oplog size significantly larger than E-F is advisable.

Now that the storage engine allows us to truncate the oldest oplog entries asynchronously when we're ready (rather than mmap which truly had a fixed size), we are able to write all oplog entries into the real, capped oplog during initial sync by instructing the storage engine to ignore the cap during initial sync, and then slowly shrink the oplog back to its desired size as we apply oplog entries and catch up to the primary.



 Comments   
Comment by Spencer Brody (Inactive) [ 06/Mar/18 ]

I think the way to go about doing this would be to make the oplog buffer used during initial sync just be the oplog instead of a separate collection. It was made a separate collection originally so that it could be uncapped and grow to unbounded size during initial sync. We now, however, have the ability to resize the oplog dynamically, so we could just let the oplog grow unbounded during initial sync, then after initial sync set it to its configured size. Then after leaving initial sync, the normal steady state oplog application logic would kick in, seeing that the lastOpApplied is behind the top of the oplog, so we'd finish applying all the oplog we already have before fetching any new oplog.

While this would definitely be an improvement, it still wouldn't get us to the point where we're completely not dependent on the size of the sync source's oplog during initial sync. This is because after finishing initial sync, we may have a large buffer of ops we fetched during initial sync that we still need to apply, and we won't fetch any new oplog entries from our sync source until we've caught up applying the ops we already have.

Generated at Thu Feb 08 04:30:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.