[SERVER-45442] Mitigate oplog impact for findAndModify commands executed with retryWrites=true Created: 09/Jan/20  Updated: 27/Oct/23  Resolved: 16/Mar/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.1.0, 4.2.16, 4.0.27, 5.0.3, 4.4.9

Type: Improvement Priority: Major - P3
Reporter: Jeremy Mikola Assignee: Alan Zheng
Resolution: Gone away Votes: 9
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File screenshot-1.png    
Issue Links:
Related
related to SERVER-45653 avoid writing unneeded fields to no-o... Closed
is related to PHPC-1523 findAndModify generates 50-100x more ... Closed
Participants:
Case:

 Description   

OP in PHPC-1523 reported that an application performing many findAndModify commands on large documents can quickly populate its oplog when retryable writes are enabled (as is the case by default since driver's 4.2-compat releases).

The drivers retryable writes specification has historically only permitted the feature to be configured at the MongoClient level in order to limit the API changes (vs. granular configuration on a database, collection, or per-operation). Therefore, it's difficult for applications to work around this and disable retryable writes just for findAndModify operations without off-loading all of those operations to a dedicated MongoClient.

That said, the side effect of findAndModify with retryable writes is not very intuitive. Given that retryable writes is advertised as a "set it and forget it" feature, I don't think most users would even consider this side effect if they were not watching their oplog activity (as was the case in PHPC-1523). As for documentation, I suppose we could note this in either the retryable writes or findAndModify docs (or both), but it does seem very specific and could easily be missed.

In a linked HELP ticket, schwerin mentioned that "our current implementation of retryable writes isn't the only viable one", so I'm opening this SERVER ticket to start a discussion on how we could alleviate this side effect for users without requiring them to change their applications or disable retryable writes altogether.



 Comments   
Comment by Daniel Gottlieb (Inactive) [ 30/Mar/22 ]

Thanks for sharing your use case! Coincidentally, my experience using findAndModify was also for a distributed lock. It's interesting to consider whether that use-case offers a more targeted alternative.

Comment by Adrien Jarthon [ 25/Mar/22 ]

Ok thanks, very clear explanation!

Indeed the double write avoided could be interesting here, I'll have to weight-in if it's worth reverting my code or not. My old code using "find_one_and_update" was simpler and more robust, the new code I had to write to avoid this has to do some optimistic locking and is less robust but it's working OK for a couple years now. If I have the motivation to do some testing I'll share the numbers.

About my use-case it needs very small piece of data from the modified record, I am using without() to exclude the big fields (mostly one which is an embeds_many and ranges from ~1MB to ~10MB). It looked like this:

Check.where(id: check_id, next_check_at: next_check_at, locked_at: nil).without(:results).find_one_and_update({ "$set" => {locked_at: now, locked_by: hostname}}, return_document: :after)

The big embeds_many field is often updated too but in an efficient way using $push and $slice (in a normal update not a find_one_and_update) so only the append data is sent over network (and in the oplog), looks like this:

collection.update_one({_id: id}, {'$push' => {results: {'$each' => Array(results).map(&:as_document), '$slice' => cap, '$sort': { time: -1 }}}})

So with this I managed to get an efficient atomic update with find_one_and_update (by excluding the big fiels) and then an efficient update of this big field using push+slice in terms of network traffic between client and server (which is important for me too because I'm querying over internet) and oplog. But the only problem after that was the big write I did not manage to reduce caused by this noop which represented more than 60% of my disk writes (despite this find_one_and_update operation being much smaller than the push+slice operation for example).

So to answer your question: yes an optimisation regarding findAndModify with excluded ressources would definitely help in my case.
Also as you can see my use-case for findAndModify is for a rather classic distributed lock, so if there is or will be a better/more native way to do distributed locks in mongodb, this could also help

Comment by Daniel Gottlieb (Inactive) [ 25/Mar/22 ]

From what you explained I understand that the document being written to the side collection will not reduce disk writes on the Primary (we still need to write 10MB + delta, it's just in a different place), correct?

I expect the graphs from the above comment would look the same (as those specific ones refer to the oplog), but this statement is correct (modulo an error I made when describing total data written to disk – I will address that in a moment).

From what you explained I understand that the document being written to the side collection will not reduce disk writes on the Primary (we still need to write 10MB + delta, it's just in a different place), correct? But it would have a good impact on replication traffic and also writes on Secondary I assume? because it looks like the pre-image is not replicated?

Correct, the change we made was targeting reducing the replication overhead.

or maybe only the delta is sent over network but the secondaries will still need to write the pre-image to their own "side collection"? This part isn't very clear yet but otherwise I think I almost got it

You have it right. We send the delta (an oplog entry) over the network. Secondaries, upon seeing `needsRetryImage: preImage` will write the pre-image to their own "side collection". They don't need the pre-image sent over the network (via the oplog or other hypothetical means) because they already have the document locally.

Could you please confirm or correct the ? Thanks a lot!

All of the (?)s were accurate based on what I had described. The only (presumably clerical) mistake was the network out on the primary when the optimization is enabled. That row should simply be 2*delta. This would match the 1*delta (in) for each of the secondaries. Conservation of network bytes!

Now factoring in the detail I omitted (which makes the optimization a bit more favorable): writes to the oplog are actually written twice to disk. Once for the Journal/WAL (happens "immediately") and a second time when the oplog is next checkpointed (typically within a minute). So in the classic configuration with the feature disabled we need to add another 10MB of disk write for both the primary and secondaries. Technically, the deltas are also "doubled" for the same reason.

The double-writing is not true for the side-collection. So the 10MB is only counted once, when the data is checkpointed. Pardon the, uh, "creative" use of parenthesis to illustrate whether a piece of data is duplicated when going to disk.

Factoring that detail into your tables gives (optimization off):

  Primary Secondaries
Disk write 2*(10MB + delta) 20MB + delta
Disk read 30MB + 2*delta 10MB
Network 20MB + 2*delta (out) 10MB + delta (in)

And the optimization on (with the corrected network usage):

  Primary Secondaries
Disk write 1*(10MB) + 2*(delta) 10MB + delta
Disk read 30MB + 2*delta 10MB
Network 2*delta (out) delta (in)

So it seems that this feature is a step in the right direction for reducing some disk wear, but I can understand it may not be sufficient for your goals. Out of curiosity, how much data does your application need from the document being updated? Do you use a projection such that only a small bit of data is actually returned? As I mentioned before, we don't optimize that, but if we were to revisit the resource usages surrounding retryable findAndModify commands, having more data points is a benefit.

Comment by Adrien Jarthon [ 25/Mar/22 ]

Thanks for this detailed explanation, it's much more clear indeed!

For the record my problem was described here: https://jira.mongodb.org/browse/SERVER-45442?focusedCommentId=3193246&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-3193246 and it pretty much matches this case: big documents (1-10MB) with very small but frequent updates. As you can see in this previous comment after finding this in the oplog and working to remove the findAndModify call, it reduced my disk writes 3 fold.

The constraint I was mostly interersted in reducing was the amount of disk write which was wearing down my SSDs too quickly, the reduced network traffic for replication and increased oplog window was a nice side-effect too but less important. Disk read was not causing any concern in my case but thanks for also providing these numbers it's always valuable for later or for other people reading this.

From what you explained I understand that the document being written to the side collection will not reduce disk writes on the Primary (we still need to write 10MB + delta, it's just in a different place), correct? But it would have a good impact on replication traffic and also writes on Secondary I assume? because it looks like the pre-image is not replicated? or maybe only the delta is sent over network but the secondaries will still need to write the pre-image to their own "side collection"? This part isn't very clear yet but otherwise I think I almost got it

Here is a little recap for this scenario (with the sideCollection option disabled so default):

  Primary Secondaries (each)
Disk write 10MB + delta 10MB + delta
Disk read 30MB + 2*delta 10MB
Network 20MB + 2*delta (out) 10MB + delta (in)

Here is a little recap for this scenario (with the sideCollection option enabled):

  Primary Secondaries
Disk write 10MB + delta 10MB + delta
Disk read 30MB + 2*delta 10MB
Network 20MB + 2*delta (out) delta (in)

Could you please confirm or correct the ? Thanks a lot!

Comment by Daniel Gottlieb (Inactive) [ 24/Mar/22 ]

I couldn't say with certainty if it would help you as I'm not sure which resource constraints your application is currently bumping up against when performing retryable findAndModify calls. Maybe to clear up a confusion:

if this collection is also persisted and replicated I wouldn't expect a positive impact

The collection is persisted, but it replicates differently (otherwise you'd be correct that there's no change in resource utilization). Maybe I'd instead say that the "side" collection "logically" replicates, but not "physically". I feel you'll get an intuition for the behavior change (and consequently how resources get used differently) by looking at what gets written to the oplog.

Without this feature a retryable findAndModify (for a basic update that returns the preImage to the client) generates the following (pseudo-code) oplog entries:

{op: "noop", ts: Timestamp(10), findAndModifyPreImage: {_id: 1, inc: 4}}}
{op: "update", ts: Timestamp(11), o2: {_id: 1}, o: {$set: {inc: 5}}, preImageOpTime: Timestamp(10)}

With the feature enabled we instead get:

{op: "update", ts: Timestamp(10), o2: {_id: 1}, o: {$set: {inc: 5}}, needsRetryImage: "preImage"}

But as you noted, we still perform a write that's equivalent to the "noop" write in size, but just to a different collection. Before comparing why this might be better, we do have to note that secondaries applying an oplog entry with needsRetryImage are doing some extra work (broadly speaking, a CPU cost). Performing an update already has to seek to the document to change, but with the feature enabled, we are now passing the existing document back up the stack so we can write out this "image" to this new "side collection".

For a resource comparison, let's exaggerate our example some. Let's continue to update a small field, but make the preImage large. This exaggeration will (1) save me some typing as it will be easier to see which of the following costs are specific to the image (and wash away the costs for a comparatively small delta) and (2) is a real scenario where customers are seeing benefits. E.g:

{_id: 1, inc: 11, binaryBlob: <10MB of BinData>}

As a quick caveat: unfortunately, even for applications that project out `binaryBlob: 0`, the whole (in this example) 10MB preImage is saved. We have options to address that, but it is a more difficult optimization (in the interest of making things perfectly safe/correct).

Edit The following usages omitted a detail that writes to the oplog go to disk twice. This comment factors in that detail.

That optimization aside, with the feature disabled, let's enumerate the major costs to consider for a replica set with 1 primary and 2 secondaries:

  • 10MB (the image) + delta is written on the primary. All to the oplog.
  • 30MB of data is read from the primary. 10MB to change the document being updated. 20MB due to each secondary reading the 10MB noop image oplog entry. [1]
  • 20MB of data is sent over the network for replicating the images.

With the feature enabled:

  • 10MB + delta is written on the primary. "Delta" data goes to the oplog.
  • 10MB (+ 2*delta) of data is read from the primary. 10MB to change the document. 2*delta for replicating the "update" oplog entry to both secondaries. [1]
  • 2*delta data is sent over the network for replicating.

I appreciate the question and the opportunity if affords me to communicate more thoroughly how this works. I hope that information helps you better evaluate whether the behavior change will result in tangible gain for your workload.

[1] There's some more complexity behind "how much data is 'read' for replicating". What's described above is the best case for the feature being "disabled". The numbers listed are correct for how many bytes we have to read from the storage engine cache and written back over the network. Those numbers are also ideal when the write to the primary is immediately followed by replication reading the new oplog entries. The data read from the cache is "warm" because it was just written to. In more problematic scenarios, secondaries lag enough for the storage engine to evict from cache the large "noop" images in the oplog. Thus when a secondary does request those oplog entries, there's an additional "cold" storage read (often resulting in more cache eviction of different data).

While we don't avoid writing the findAndModify pre-image to the primary (and thus displacing some cached data), it's overwhelmingly the case that retryable writes don't need to be retried (alternatively, the cluster is generally unhealthy and the application already has a problem). Thus for common situations, we can assume the cost for reading out the image is effectively 0. Both cases (feature enabled and disabled) scale similarly if there are lots of retries.

Comment by Adrien Jarthon [ 24/Mar/22 ]

Thanks Daniel for this very valuable addition, I could have looked for a while without knowing this ^^

From reading the storeFindAndModifyImagesInSideCollection doc I see: "the temporary documents are stored in the side collection". So now I'm wondering: ok with this option the document is no longer in the oplog but in a side collection so does this change much the ressources used? if this collection is also persisted and replicated I wouldn't expect a positive impact. I won't speed much time trying this if It means going back to the same amount of disk write as before. Do you have any more precision about this? what does this option improves in the end?

Thanks!

Comment by Daniel Gottlieb (Inactive) [ 23/Mar/22 ]

Filling in some more details for enabling the feature. The versions listed are correct (mostly – I'm modifying them to include some bug fixes), but the feature is not enabled by default as the feature includes a data format change that can in some cases inhibit downgrading (or upgrading to the next major release, but to a version that doesn't have support). The feature is slated to be enabled by default for the 6.0 release.

This feature can be turned on by setting the setParameter storeFindAndModifyImagesInSideCollection to true for all nodes in the replica set. That documentation link is for 5.1, but the invocation with setParameter is the same for those versions as well.

As I said there's a format change and we've formally put documenting the upgrade/downgrade procedure into motion. When downgrading (or "upgrading") to a version that doesn't support this feature, e.g: from 4.2.27 to 4.2.0 or 4.2.27 to 4.4.0 one must:

  • Turn off the feature by setting storeFindAndModifyImagesInSideCollection to false for all nodes in the replica set. This can be done at runtime with the setParameter command or with a server restart.
  • Wait some time. If you're in a hurry and the replica set is healthy (all nodes are alive and keeping up with replication) it shouldn't be necessary to wait more than a handful of seconds. If you're not in a rush its easiest to give it a few minutes. Strictly speaking, we need:
    • all nodes performing in the downgrade to have applied oplog entries after the setParameter was set to false
    • the "replica set commit point" must have surpassed the last oplog entry written out with the new data format
  • Restart the nodes with the new binary

Failure to successfully follow the above steps should result in the following error message at startup:

[initandlisten] Caught exception during replication recovery: Location40415: BSON field 'OplogEntryBase.needsRetryImage' is an unknown field.

To resolve the situation, one must restart using the prior binary with the feature disabled and wait longer while the replica set is healthy.

I hope this information helps. Let me know if there are any further questions.

Comment by Daniel Pasette (Inactive) [ 23/Mar/22 ]

I'll admit that it's a strange workflow and doesn't give users good visibility into what happened with the issue. I've reached out to our TPM's to see if we can improve this workflow. Thanks!

Comment by Adrien Jarthon [ 23/Mar/22 ]

Definitely, as you can see in my previous comment, I had to stop using findAndModify because it was generating 3 times more oplog and writes to disk when doing do (just because of one call). Because I am doing very small updates to big documents often.

Thanks for the versions! can't you put them in "Fix Version/s" at least? or your Jira workflow prevents this too?

Comment by Daniel Pasette (Inactive) [ 23/Mar/22 ]

Hi Adrien, I'm glad this work will benefit you! The reason for "Gone Away" rather than "Fixed" is a quirk of our Jira reporting. We only mark tickets as "Fixed" when they have a real commit associated with them. And because the "fix" for this issue actually spans multiple work tickets/commits, there is no easy way to mark this ticket as a "Duplicate" which we would ordinarily do for a simple issue.

The complete fix was backported to 4.0.27, 4.2.16, 4.4.9, and 5.0.3

Comment by Adrien Jarthon [ 22/Mar/22 ]

That is great news! in which backported version is the fix present exactly? It would be good to specify those in the "Fix versions" of the tickets for other people looking (also applies to SERVER-45653 for easier find). Also quick question but why mark this as "Gone" and not "Fixed"?

Thanks, I'll give this a try and maybe I'll be able to put back my findAndModify to simplify the code again 

 

Comment by Judah Schvimer [ 16/Mar/22 ]

As of MongoDB 5.1, we no longer store findAndModify preimages in the oplog. As such, I'm closing this ticket as Gone Away. The fix was backported to 4.0, 4.2, 4.4, and 5.0 as well.

Comment by Asya Kamsky [ 31/Jul/20 ]

Note that SERVER-45653 is a real fix that would mitigate this issue for users with large documents that only request a small subset of fields be returned to findAndModify.

Comment by Adrien Jarthon [ 05/Jun/20 ]

Hi, just casting my vote on this one, I had a workload with some big documents that are often updated (but only updating small attributes) and I used find_and_modify to do atomic updates. When I realized I was generating more than 30GB oplog / hour for a database which size is around 30GB I was quite surprised. When looking inside the oplog I noticed these huge no-op instructions sending my entire document whereas it didn't change and searching online found this ticket. So I tried to rewrite my code to eliminate the atomic operation and replace the find_and_modify with a find, it made the code a bit more complex but:

→ almost 3 times less GB/hour!

Would be better if I could use atomic operations though so +1 if this can be improved at some point

Comment by Garaudy Etienne [ 05/Jun/20 ]

back to scheduling please

Comment by Garaudy Etienne [ 04/Mar/20 ]

Hi jmikola,

This is not something we can currently address due to other competing priorities for our team.

Do keep us updated if more users flag similar issues so we can prioritize accordingly!

Comment by Esha Maharishi (Inactive) [ 14/Feb/20 ]

Note, one workaround we could do in the future is apply the query projection so that we only record in the pre- or post- image the fields that were projected.

Comment by Carl Champain (Inactive) [ 09/Jan/20 ]

Hi jmikola,

Passing this ticket along to the appropriate team.

Comment by Glen Miner [ 09/Jan/20 ]

In the short term it would be extremely helpful to have a server config var to disable retryWrites; I'm worried about programmers inadvertently swamping our oplog and causing secondaries to fall into recovering.

Generated at Thu Feb 08 05:08:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.