[SERVER-69948] Prevent entries with outdated txnNum entries from creating config.image_collection documents Created: 23/Sep/22  Updated: 29/Oct/23  Resolved: 19/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: Christopher Caplinger Assignee: Christopher Caplinger
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Server Serverless 2022-10-17, Server Serverless 2022-10-31
Participants:
Linked BF Score: 0

 Description   

we can re-enable hash checking for this collection after PM-2981 TBD based on how we decide to resolve the issue



 Comments   
Comment by Githook User [ 18/Oct/22 ]

Author:

{'name': 'Christopher Caplinger', 'email': 'christopher.caplinger@mongodb.com', 'username': 'UnicodeSnowman'}

Message: SERVER-69948: Strip needsRetryImage from findAndModify oplog entries
Branch: master
https://github.com/mongodb/mongo/commit/6d866484e8245ae887ce4d870a132568a7132fa5

Comment by Daniel Gottlieb (Inactive) [ 27/Sep/22 ]

I don't know this well enough to confidently claim option two "would just work", but I prefer that approach. I imagine it'd be a simple thing to implement and throw up a patch build and see what signal we get from it.

edit I should refresh webpages before adding a post-lunch comment

Comment by Jason Chan [ 27/Sep/22 ]

The second path seems reasonable to me and the server change itself shouldn't be too hard. The idea will be to modify DocumentSourceFindAndModifyImageLookup so that instead of returning a no-op when we fail to look up the corresponding image entry in the donor replica set, we transform the document by stripping the needsRetryImage field. Testing should be hopefully straightforward as well with unit testing, some of which already exist.

I think for completion, we should consider also adding a jstest so we can verify the behavior that no image entries get generated on the recipient replica set. This will be harder to write as we would need to synchronize the user writes with the reads from the tenant oplog fetcher on the donor so that the txnNumber processed by the fetcher at the time becomes stale.

Comment by Christopher Caplinger [ 27/Sep/22 ]

Spoke with didier.nadeau@mongodb.com and suganthi.mani@mongodb.com about this yesterday and it seems like we have a couple of options here:

  • Update our test fixture(s) dbhash check logic to avoid failure for config.image_collection hash discrepancies during tenant migration passthrough suites
  • Strip “needRetryImage” field from old retryable writes oplog entries in a given session to avoid unnecessarily generating image_collection entries

Note, the second option will not only fix the (admittedly rare) test failure, but will resolve the underlying issue and prevent any future confusion if/when this happens in a production environment. The consensus on the serverless team is to go with the second option above, but I'm not personally sure how much effort will be involved here, but will likely involve some more specific scheduling concerns to actually do the work.

cc jason.chan@mongodb.com and daniel.gottlieb@mongodb.com for thoughts/opinions on a fix for this since you guys have some context.

Comment by Steven Vannelli [ 26/Sep/22 ]

Keeping this in Needs Scheduling until Chris and suganthi.mani@mongodb.com talk about the solution. 

Generated at Thu Feb 08 06:14:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.