[SERVER-45442] Mitigate oplog impact for findAndModify commands executed with retryWrites=true Created: 09/Jan/20 Updated: 27/Oct/23 Resolved: 16/Mar/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 5.1.0, 4.2.16, 4.0.27, 5.0.3, 4.4.9 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Jeremy Mikola | Assignee: | Alan Zheng |
| Resolution: | Gone away | Votes: | 9 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Participants: | |||||||||||||
| Case: | (copied to CRM) | ||||||||||||
| Description |
|
OP in The drivers retryable writes specification has historically only permitted the feature to be configured at the MongoClient level in order to limit the API changes (vs. granular configuration on a database, collection, or per-operation). Therefore, it's difficult for applications to work around this and disable retryable writes just for findAndModify operations without off-loading all of those operations to a dedicated MongoClient. That said, the side effect of findAndModify with retryable writes is not very intuitive. Given that retryable writes is advertised as a "set it and forget it" feature, I don't think most users would even consider this side effect if they were not watching their oplog activity (as was the case in In a linked HELP ticket, schwerin mentioned that "our current implementation of retryable writes isn't the only viable one", so I'm opening this SERVER ticket to start a discussion on how we could alleviate this side effect for users without requiring them to change their applications or disable retryable writes altogether. |
| Comments |
| Comment by Daniel Gottlieb (Inactive) [ 30/Mar/22 ] | ||||||||||||||||||||||||
|
Thanks for sharing your use case! Coincidentally, my experience using findAndModify was also for a distributed lock. It's interesting to consider whether that use-case offers a more targeted alternative. | ||||||||||||||||||||||||
| Comment by Adrien Jarthon [ 25/Mar/22 ] | ||||||||||||||||||||||||
|
Ok thanks, very clear explanation! Indeed the double write avoided could be interesting here, I'll have to weight-in if it's worth reverting my code or not. My old code using "find_one_and_update" was simpler and more robust, the new code I had to write to avoid this has to do some optimistic locking and is less robust but it's working OK for a couple years now. If I have the motivation to do some testing I'll share the numbers. About my use-case it needs very small piece of data from the modified record, I am using without() to exclude the big fields (mostly one which is an embeds_many and ranges from ~1MB to ~10MB). It looked like this:
The big embeds_many field is often updated too but in an efficient way using $push and $slice (in a normal update not a find_one_and_update) so only the append data is sent over network (and in the oplog), looks like this:
So with this I managed to get an efficient atomic update with find_one_and_update (by excluding the big fiels) and then an efficient update of this big field using push+slice in terms of network traffic between client and server (which is important for me too because I'm querying over internet) and oplog. But the only problem after that was the big write I did not manage to reduce caused by this noop which represented more than 60% of my disk writes (despite this find_one_and_update operation being much smaller than the push+slice operation for example). So to answer your question: yes an optimisation regarding findAndModify with excluded ressources would definitely help in my case. | ||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 25/Mar/22 ] | ||||||||||||||||||||||||
I expect the graphs from the above comment would look the same (as those specific ones refer to the oplog), but this statement is correct (modulo an error I made when describing total data written to disk – I will address that in a moment).
Correct, the change we made was targeting reducing the replication overhead.
You have it right. We send the delta (an oplog entry) over the network. Secondaries, upon seeing `needsRetryImage: preImage` will write the pre-image to their own "side collection". They don't need the pre-image sent over the network (via the oplog or other hypothetical means) because they already have the document locally.
All of the (?)s were accurate based on what I had described. The only (presumably clerical) mistake was the network out on the primary when the optimization is enabled. That row should simply be 2*delta. This would match the 1*delta (in) for each of the secondaries. Conservation of network bytes! Now factoring in the detail I omitted (which makes the optimization a bit more favorable): writes to the oplog are actually written twice to disk. Once for the Journal/WAL (happens "immediately") and a second time when the oplog is next checkpointed (typically within a minute). So in the classic configuration with the feature disabled we need to add another 10MB of disk write for both the primary and secondaries. Technically, the deltas are also "doubled" for the same reason. The double-writing is not true for the side-collection. So the 10MB is only counted once, when the data is checkpointed. Pardon the, uh, "creative" use of parenthesis to illustrate whether a piece of data is duplicated when going to disk. Factoring that detail into your tables gives (optimization off):
And the optimization on (with the corrected network usage):
So it seems that this feature is a step in the right direction for reducing some disk wear, but I can understand it may not be sufficient for your goals. Out of curiosity, how much data does your application need from the document being updated? Do you use a projection such that only a small bit of data is actually returned? As I mentioned before, we don't optimize that, but if we were to revisit the resource usages surrounding retryable findAndModify commands, having more data points is a benefit. | ||||||||||||||||||||||||
| Comment by Adrien Jarthon [ 25/Mar/22 ] | ||||||||||||||||||||||||
|
Thanks for this detailed explanation, it's much more clear indeed! For the record my problem was described here: https://jira.mongodb.org/browse/SERVER-45442?focusedCommentId=3193246&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-3193246 and it pretty much matches this case: big documents (1-10MB) with very small but frequent updates. As you can see in this previous comment after finding this in the oplog and working to remove the findAndModify call, it reduced my disk writes 3 fold. The constraint I was mostly interersted in reducing was the amount of disk write which was wearing down my SSDs too quickly, the reduced network traffic for replication and increased oplog window was a nice side-effect too but less important. Disk read was not causing any concern in my case but thanks for also providing these numbers it's always valuable for later or for other people reading this. From what you explained I understand that the document being written to the side collection will not reduce disk writes on the Primary (we still need to write 10MB + delta, it's just in a different place), correct? But it would have a good impact on replication traffic and also writes on Secondary I assume? because it looks like the pre-image is not replicated? or maybe only the delta is sent over network but the secondaries will still need to write the pre-image to their own "side collection"? This part isn't very clear yet but otherwise I think I almost got it Here is a little recap for this scenario (with the sideCollection option disabled so default):
Here is a little recap for this scenario (with the sideCollection option enabled):
Could you please confirm or correct the | ||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 24/Mar/22 ] | ||||||||||||||||||||||||
|
I couldn't say with certainty if it would help you as I'm not sure which resource constraints your application is currently bumping up against when performing retryable findAndModify calls. Maybe to clear up a confusion:
The collection is persisted, but it replicates differently (otherwise you'd be correct that there's no change in resource utilization). Maybe I'd instead say that the "side" collection "logically" replicates, but not "physically". I feel you'll get an intuition for the behavior change (and consequently how resources get used differently) by looking at what gets written to the oplog. Without this feature a retryable findAndModify (for a basic update that returns the preImage to the client) generates the following (pseudo-code) oplog entries:
With the feature enabled we instead get:
But as you noted, we still perform a write that's equivalent to the "noop" write in size, but just to a different collection. Before comparing why this might be better, we do have to note that secondaries applying an oplog entry with needsRetryImage are doing some extra work (broadly speaking, a CPU cost). Performing an update already has to seek to the document to change, but with the feature enabled, we are now passing the existing document back up the stack so we can write out this "image" to this new "side collection". For a resource comparison, let's exaggerate our example some. Let's continue to update a small field, but make the preImage large. This exaggeration will (1) save me some typing as it will be easier to see which of the following costs are specific to the image (and wash away the costs for a comparatively small delta) and (2) is a real scenario where customers are seeing benefits. E.g:
As a quick caveat: unfortunately, even for applications that project out `binaryBlob: 0`, the whole (in this example) 10MB preImage is saved. We have options to address that, but it is a more difficult optimization (in the interest of making things perfectly safe/correct). Edit The following usages omitted a detail that writes to the oplog go to disk twice. This comment factors in that detail. That optimization aside, with the feature disabled, let's enumerate the major costs to consider for a replica set with 1 primary and 2 secondaries:
With the feature enabled:
I appreciate the question and the opportunity if affords me to communicate more thoroughly how this works. I hope that information helps you better evaluate whether the behavior change will result in tangible gain for your workload. [1] There's some more complexity behind "how much data is 'read' for replicating". What's described above is the best case for the feature being "disabled". The numbers listed are correct for how many bytes we have to read from the storage engine cache and written back over the network. Those numbers are also ideal when the write to the primary is immediately followed by replication reading the new oplog entries. The data read from the cache is "warm" because it was just written to. In more problematic scenarios, secondaries lag enough for the storage engine to evict from cache the large "noop" images in the oplog. Thus when a secondary does request those oplog entries, there's an additional "cold" storage read (often resulting in more cache eviction of different data). While we don't avoid writing the findAndModify pre-image to the primary (and thus displacing some cached data), it's overwhelmingly the case that retryable writes don't need to be retried (alternatively, the cluster is generally unhealthy and the application already has a problem). Thus for common situations, we can assume the cost for reading out the image is effectively 0. Both cases (feature enabled and disabled) scale similarly if there are lots of retries. | ||||||||||||||||||||||||
| Comment by Adrien Jarthon [ 24/Mar/22 ] | ||||||||||||||||||||||||
|
Thanks Daniel for this very valuable addition, I could have looked for a while without knowing this ^^ From reading the storeFindAndModifyImagesInSideCollection doc I see: "the temporary documents are stored in the side collection". So now I'm wondering: ok with this option the document is no longer in the oplog but in a side collection so does this change much the ressources used? if this collection is also persisted and replicated I wouldn't expect a positive impact. I won't speed much time trying this if It means going back to the same amount of disk write as before. Do you have any more precision about this? what does this option improves in the end? Thanks! | ||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 23/Mar/22 ] | ||||||||||||||||||||||||
|
Filling in some more details for enabling the feature. The versions listed are correct (mostly – I'm modifying them to include some bug fixes), but the feature is not enabled by default as the feature includes a data format change that can in some cases inhibit downgrading (or upgrading to the next major release, but to a version that doesn't have support). The feature is slated to be enabled by default for the 6.0 release. This feature can be turned on by setting the setParameter storeFindAndModifyImagesInSideCollection to true for all nodes in the replica set. That documentation link is for 5.1, but the invocation with setParameter is the same for those versions as well. As I said there's a format change and we've formally put documenting the upgrade/downgrade procedure into motion. When downgrading (or "upgrading") to a version that doesn't support this feature, e.g: from 4.2.27 to 4.2.0 or 4.2.27 to 4.4.0 one must:
Failure to successfully follow the above steps should result in the following error message at startup:
To resolve the situation, one must restart using the prior binary with the feature disabled and wait longer while the replica set is healthy. I hope this information helps. Let me know if there are any further questions. | ||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 23/Mar/22 ] | ||||||||||||||||||||||||
|
I'll admit that it's a strange workflow and doesn't give users good visibility into what happened with the issue. I've reached out to our TPM's to see if we can improve this workflow. Thanks! | ||||||||||||||||||||||||
| Comment by Adrien Jarthon [ 23/Mar/22 ] | ||||||||||||||||||||||||
|
Definitely, as you can see in my previous comment, I had to stop using findAndModify because it was generating 3 times more oplog and writes to disk when doing do (just because of one call). Because I am doing very small updates to big documents often. Thanks for the versions! can't you put them in "Fix Version/s" at least? or your Jira workflow prevents this too? | ||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 23/Mar/22 ] | ||||||||||||||||||||||||
|
Hi Adrien, I'm glad this work will benefit you! The reason for "Gone Away" rather than "Fixed" is a quirk of our Jira reporting. We only mark tickets as "Fixed" when they have a real commit associated with them. And because the "fix" for this issue actually spans multiple work tickets/commits, there is no easy way to mark this ticket as a "Duplicate" which we would ordinarily do for a simple issue. The complete fix was backported to 4.0.27, 4.2.16, 4.4.9, and 5.0.3 | ||||||||||||||||||||||||
| Comment by Adrien Jarthon [ 22/Mar/22 ] | ||||||||||||||||||||||||
|
That is great news! in which backported version is the fix present exactly? It would be good to specify those in the "Fix versions" of the tickets for other people looking (also applies to Thanks, I'll give this a try and maybe I'll be able to put back my findAndModify to simplify the code again
| ||||||||||||||||||||||||
| Comment by Judah Schvimer [ 16/Mar/22 ] | ||||||||||||||||||||||||
|
As of MongoDB 5.1, we no longer store findAndModify preimages in the oplog. As such, I'm closing this ticket as Gone Away. The fix was backported to 4.0, 4.2, 4.4, and 5.0 as well. | ||||||||||||||||||||||||
| Comment by Asya Kamsky [ 31/Jul/20 ] | ||||||||||||||||||||||||
|
Note that | ||||||||||||||||||||||||
| Comment by Adrien Jarthon [ 05/Jun/20 ] | ||||||||||||||||||||||||
|
Hi, just casting my vote on this one, I had a workload with some big documents that are often updated (but only updating small attributes) and I used find_and_modify to do atomic updates. When I realized I was generating more than 30GB oplog / hour for a database which size is around 30GB I was quite surprised. When looking inside the oplog I noticed these huge no-op instructions sending my entire document whereas it didn't change and searching online found this ticket. So I tried to rewrite my code to eliminate the atomic operation and replace the find_and_modify with a find, it made the code a bit more complex but: → almost 3 times less GB/hour! Would be better if I could use atomic operations though so +1 if this can be improved at some point | ||||||||||||||||||||||||
| Comment by Garaudy Etienne [ 05/Jun/20 ] | ||||||||||||||||||||||||
|
back to scheduling please | ||||||||||||||||||||||||
| Comment by Garaudy Etienne [ 04/Mar/20 ] | ||||||||||||||||||||||||
|
Hi jmikola, This is not something we can currently address due to other competing priorities for our team. Do keep us updated if more users flag similar issues so we can prioritize accordingly! | ||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 14/Feb/20 ] | ||||||||||||||||||||||||
|
Note, one workaround we could do in the future is apply the query projection so that we only record in the pre- or post- image the fields that were projected. | ||||||||||||||||||||||||
| Comment by Carl Champain (Inactive) [ 09/Jan/20 ] | ||||||||||||||||||||||||
|
Hi jmikola, Passing this ticket along to the appropriate team. | ||||||||||||||||||||||||
| Comment by Glen Miner [ 09/Jan/20 ] | ||||||||||||||||||||||||
|
In the short term it would be extremely helpful to have a server config var to disable retryWrites; I'm worried about programmers inadvertently swamping our oplog and causing secondaries to fall into recovering. |