[SERVER-61188] When storeImageInSideCollection=true, pre-image noop entries for collection with preImageRecordingEnabledForCollection=true are assigned wrong opTimes Created: 02/Nov/21 Updated: 29/Oct/23 Resolved: 15/Nov/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 5.2.0, 5.1.2, 5.0.6 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Cheahuychou Mao | Assignee: | Jason Chan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v5.1, v5.0
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Replication 2021-11-15, Replication 2021-11-29 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
When storeImageInSideCollection=true, we reserve two oplog slots, T-1 and T, for every findAndModify operation with a pre or post image:
For a collection with preImageRecordingEnabledForCollection=false, we write the update oplog entry in the reserved slot T and write the pre or post image to a config.image_collection entry. However, for a collection with preImageRecordingEnabledForCollection=true, we currently write the pre-image to a noop oplog entry in some other oplog slot T' > T, and then write the update oplog entry in the reserved slot T. That latter is invalid so we hit this invariant in the ReplClientInfo. That is:
This invariant failure can be reproduced by applying the attached diffs (containing code taken from reserveOplogSlotsForRetryableFindAndModify()) and running the following test cases in OpObserverTest/TestFundamentalOnUpdateOutputs.
|
| Comments |
| Comment by Githook User [ 07/Dec/21 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}Message: | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 06/Dec/21 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}Message: | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 30/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}Message: (cherry picked from commit 634a3411d8f2f8e7dc10149f7907527af8e06204) | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 15/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}Message: | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Jason Chan [ 15/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||
|
A table detailing the reserveOplogSlots semantics for combinations of retryable findAndModify and preImageRecordingEnabled:
| ||||||||||||||||||||||||||||||||||||||||||
| Comment by Jason Chan [ 05/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Thinking about this some more and after talking with daniel.gottlieb, I think we also missed this case of storeImageInSideCollection=true and preImageRecordingEnabled=true for chunk migrations. In the correct scenario, when a chunk migration occurs for a retryable write oplog entry E that was written with preImageRecordingEnabled=true and storeImageInSideCollection=true (with needsRetryImage: postImage), we should have the following: However, currently the migration source processes a retryable write oplog, it will either forge a no-op entry from the needsRetryImage field, or fetch the preImage entry from the oplog, but not do both because we only ever (incorrectly) assume that we would only need to store one image in the buffer. At a high level, the following changes need to be made:
The final result will be a retryable write oplog entry on the destination with links to two different images in the oplog. max.hirschhorn daniel.gottlieb, I don't think we have much E2E test coverage in terms of retryability in this scenario. Do we know whether we expect customers to be running preImageRecordingEnabled=true? I see that it's a documented Realm feature here. EDIT: It turns out Realm Sync doesn't support sharded clusters so chunk migrations not working with preImageRecordingEnabled=true may be expected. |