[SERVER-61188] When storeImageInSideCollection=true, pre-image noop entries for collection with preImageRecordingEnabledForCollection=true are assigned wrong opTimes Created: 02/Nov/21  Updated: 29/Oct/23  Resolved: 15/Nov/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.2.0, 5.1.2, 5.0.6

Type: Bug Priority: Major - P3
Reporter: Cheahuychou Mao Assignee: Jason Chan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-81080 rename OplogDeleteEntryArgs::oplogSlo... Closed
is related to DOCS-14920 5.1 Known Issues Closed
is related to SERVER-61498 Add reserveOplogSlots table from SERV... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.1, v5.0
Steps To Reproduce:

diff --git a/src/mongo/db/op_observer_impl.cpp b/src/mongo/db/op_observer_impl.cpp
index c3b07dba589..cf3595ce556 100644
--- a/src/mongo/db/op_observer_impl.cpp
+++ b/src/mongo/db/op_observer_impl.cpp
@@ -195,6 +195,7 @@ OpTimeBundle replLogUpdate(OperationContext* opCtx,
         if (storePreImageInOplogForRetryableWrite) {
             opTimes.prePostImageOpTime = oplogLink.preImageOpTime;
         }
+        LOGV2(10001, "replLogUpdate", "preImage noop"_attr = oplogLink.preImageOpTime);
     }
 
     // This case handles storing the post image for retryable findAndModify's.
@@ -206,6 +207,7 @@ OpTimeBundle replLogUpdate(OperationContext* opCtx,
         oplogLink.postImageOpTime = logOperation(opCtx, &noopEntry);
         invariant(opTimes.prePostImageOpTime.isNull());
         opTimes.prePostImageOpTime = oplogLink.postImageOpTime;
+        LOGV2(10002, "replLogUpdate", "postImage noop"_attr = oplogLink.postImageOpTime);
     }
 
     oplogEntry->setOpType(repl::OpTypeEnum::kUpdate);
@@ -215,6 +217,7 @@ OpTimeBundle replLogUpdate(OperationContext* opCtx,
     // oplogLink could have been changed to include pre/postImageOpTime by the previous no-op write.
     repl::appendOplogEntryChainInfo(opCtx, oplogEntry, &oplogLink, args.updateArgs.stmtIds);
     if (args.updateArgs.oplogSlot) {
+        LOGV2(10003, "replLogUpdate", "oplogSlot"_attr = args.updateArgs.oplogSlot);
         oplogEntry->setOpTime(*args.updateArgs.oplogSlot);
     }
     opTimes.writeOpTime = logOperation(opCtx, oplogEntry);
diff --git a/src/mongo/db/op_observer_impl_test.cpp b/src/mongo/db/op_observer_impl_test.cpp
index 653b16340b3..a2089b69357 100644
--- a/src/mongo/db/op_observer_impl_test.cpp
+++ b/src/mongo/db/op_observer_impl_test.cpp
@@ -1764,11 +1764,23 @@ TEST_F(OpObserverTest, TestFundamentalOnUpdateOutputs) {
             BSON("$set" << BSON("postImage" << true) << "$unset" << BSON("preImage" << 1));
         updateArgs.criteria = BSON("_id" << 0);
         updateArgs.storeDocOption = testCase.imageType;
-        OplogUpdateEntryArgs update(std::move(updateArgs), nss, uuid);
 
         // Phase 2: Call the code we're testing.
         WriteUnitOfWork wuow(opCtx);
         AutoGetCollection locks(opCtx, nss, LockMode::MODE_IX);
+        if (testCase.imageType != StoreDocOption::None &&
+            testCase.retryableOptions != RetryableOptions::kNotRetryable &&
+            updateArgs.storeImageInSideCollection) {
+            // We reserve two oplog slots here, expecting the greater of the two (say TS) to be used
+            // as the oplog timestamp. Tenant migrations and resharding will forge no-op image oplog
+            // entries and set the timestamp for these synthetic entries to be TS - 1.
+            auto oplogInfo = LocalOplogInfo::get(opCtx);
+            auto slots = oplogInfo->getNextOpTimes(opCtx, 2);
+            uassertStatusOK(opCtx->recoveryUnit()->setTimestamp(slots[1].getTimestamp()));
+            updateArgs.oplogSlot = slots[1];
+            LOGV2(10004, "Reserved", "slots"_attr = slots);
+        }
+        OplogUpdateEntryArgs update(std::move(updateArgs), nss, uuid);
         opObserver.onUpdate(opCtx, update);
         wuow.commit();
 

Sprint: Replication 2021-11-15, Replication 2021-11-29
Participants:

 Description   

When storeImageInSideCollection=true, we reserve two oplog slots, T-1 and T, for every findAndModify operation with a pre or post image:

  • T-1 is reserved for the potential forged noop entry for the pre or post image during tenant migration or resharding.
  • T is reserved for the update oplog entry itself.

For a collection with preImageRecordingEnabledForCollection=false, we write the update oplog entry in the reserved slot T and write the pre or post image to a config.image_collection entry. However, for a collection with preImageRecordingEnabledForCollection=true, we currently write the pre-image to a noop oplog entry in some other oplog slot T' > T, and then write the update oplog entry in the reserved slot T. That latter is invalid so we hit this invariant in the ReplClientInfo. That is:

  1. For a findAndModify operation with preImage, we end up with the following oplog entries:
    • T-1: reserved for potential forged pre-image noop oplog entry
    • T: update oplog entry
    • T': pre-image noop oplog entry
  2. For a findAndModify operation with postImage, we end up with the following oplog entries:
    • T-1: reserved for potential forged post-image noop oplog entry
    • T: update oplog entry
    • T': pre-image noop oplog entry

This invariant failure can be reproduced by applying the attached diffs (containing code taken from reserveOplogSlotsForRetryableFindAndModify()) and running the following test cases in OpObserverTest/TestFundamentalOnUpdateOutputs.

2021-11-02T06:52:46.159Z I  TEST     [thread1] UpdateTestCase {"ImageType":"PreImage","PreImageRecording":true,"ChangeStreamPreAndPostImagesEnabled":false,"RetryableOptions":"Images in side collection","ExpectedOplogEntries":2}
...
2021-11-02T06:52:46.159Z I  TEST     [thread1] Reserved {"slots":[{"ts":{"$timestamp":{"t":1635835966,"i":27}},"t":0},{"ts":{"$timestamp":{"t":1635835966,"i":28}},"t":0}]}
2021-11-02T06:52:46.159Z I  REPL     [thread1] replLogUpdate {"preImage noop":{"ts":{"$timestamp":{"t":1635835966,"i":29}},"t":0}}
2021-11-02T06:52:46.159Z I  REPL     [thread1] replLogUpdate {"oplogSlot":{"ts":{"$timestamp":{"t":1635835966,"i":28}},"t":0}}
2021-11-02T06:52:46.159Z F  ASSERT   [thread1] Invariant failure {"expr":"ot >= _lastOp","file":"src/mongo/db/repl/repl_client_info.cpp","line":64}

2021-11-02T08:05:39.296Z I  TEST     [main] UpdateTestCase {"ImageType":"PostImage","PreImageRecording":true,"ChangeStreamPreAndPostImagesEnabled":false,"RetryableOptions":"Images in side collection","ExpectedOplogEntries":2}
...
2021-11-02T08:05:39.296Z I  TEST     [main] Reserved {"slots":[{"ts":{"$timestamp":{"t":1635840339,"i":42}},"t":0},{"ts":{"$timestamp":{"t":1635840339,"i":43}},"t":0}]}
2021-11-02T08:05:39.296Z I  REPL     [main] replLogUpdate {"preImage noop":{"ts":{"$timestamp":{"t":1635840339,"i":44}},"t":0}}
2021-11-02T08:05:39.296Z I  REPL     [main] replLogUpdate {"oplogSlot":{"ts":{"$timestamp":{"t":1635840339,"i":43}},"t":0}}
2021-11-02T08:05:39.296Z F  ASSERT   [main] Invariant failure {"expr":"ot >= _lastOp","file":"src/mongo/db/repl/repl_client_info.cpp","line":64}



 Comments   
Comment by Githook User [ 07/Dec/21 ]

Author:

{'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}

Message: SERVER-61498 Add reserveOplogSlots table from SERVER-61188 into architecture guide
Branch: master
https://github.com/mongodb/mongo/commit/45c5b2fcbf4273748f1acd92ef6fc65365c323b4

Comment by Githook User [ 06/Dec/21 ]

Author:

{'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}

Message: SERVER-61188 Account for preImageRecordingEnabled when reserving oplog slots for new retryable findAndModify format
Branch: v5.0
https://github.com/mongodb/mongo/commit/1055538d5471decbae4cf6b7a29e3d743e1300fd

Comment by Githook User [ 30/Nov/21 ]

Author:

{'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}

Message: SERVER-61188 Account for preImageRecordingEnabled when reserving oplog slots for new retryable findAndModify format

(cherry picked from commit 634a3411d8f2f8e7dc10149f7907527af8e06204)
Branch: v5.1
https://github.com/mongodb/mongo/commit/0862e4c67379ceb8b2b2778770477513bcd964c7

Comment by Githook User [ 15/Nov/21 ]

Author:

{'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}

Message: SERVER-61188 Account for preImageRecordingEnabled when reserving oplog slots for new retryable findAndModify format
Branch: master
https://github.com/mongodb/mongo/commit/634a3411d8f2f8e7dc10149f7907527af8e06204

Comment by Jason Chan [ 15/Nov/21 ]

A table detailing the reserveOplogSlots semantics for combinations of retryable findAndModify and preImageRecordingEnabled:

Parameters NumSlotsReserved TS - 2 TS - 1 TS Oplog fields for entry with timestamp: TS
Update,
NeedsRetryImage=postImage,
preImageRecordingEnabled = True
3 No-op oplog entry storing the pre-image Reserved for forged no-op entry eventually used by tenant migrations/resharding Update oplog entry NeedsRetryImage: postImage, preImageOpTime: {TS - 2}
Update,
NeedsRetryImage=preImage,
preImageRecordingEnabled=True
3 No-op oplog entry storing the pre-image Reserved but will not be used Update Oplog entry preImageOpTime: {TS - 2}
Update,
NeedsRetryImage=preImage,
preImageRecordingEnabled=False
2 N/A Reserved for forged no-op entry eventually used by tenant migrations/resharding Update oplog entry NeedsRetryImage: preImage
Update,
NeedsRetryImage=postImage,
preImageRecordingEnabled=False
2 N/A Reserved for forged no-op entry eventually used by tenant migrations/resharding Update oplog entry NeedsRetryImage: postImage
Delete, NeedsRetryImage=preImage,
preImageRecordingEnabled = True
0 N/A No-op oplog entry storing the pre-image Delete oplog entry preImageOpTime: {TS - 1}
Delete, NeedsRetryImage=preImage,
preImageRecordingEnabled = False
2 N/A Reserved for forged no-op entry eventually used by tenant migrations/resharding Delete oplog entry NeedsRetryImage: preImage
Comment by Jason Chan [ 05/Nov/21 ]

Thinking about this some more and after talking with daniel.gottlieb, I think we also missed this case of storeImageInSideCollection=true and preImageRecordingEnabled=true for chunk migrations. 

In the correct scenario, when a chunk migration occurs for a retryable write oplog entry E that was written with preImageRecordingEnabled=true and storeImageInSideCollection=true (with needsRetryImage: postImage), we should have the following:
1. Source processes E and fetches the no-op pre-image entry from the oplog and batches it
2. Source forges the no-op post-image entry and batches it
3. Source batches E
4. Destination receives 3 oplog entries

However, currently the migration source processes a retryable write oplog, it will either forge a no-op entry from the needsRetryImage field, or fetch the preImage entry from the oplog, but not do both because we only ever (incorrectly) assume that we would only need to store one image in the buffer.

At a high level, the following changes need to be made:

  • chunk migration source must account for fetching the pre-image no-op entry even if it has forged a post-image no-op entry due to existing oplog field needsRetryImage: postImage
  • chunk migration destination can no longer assume that the correct pre-Image/postImage link is the lastResult because now an oplog entry can have optime links to two separate oplog entries that preceded it.

The final result will be a retryable write oplog entry on the destination with links to two different images in the oplog. max.hirschhorn daniel.gottlieb, I don't think we have much E2E test coverage in terms of retryability in this scenario. Do we know whether we expect customers to be running preImageRecordingEnabled=true? I see that it's a documented Realm feature here.

EDIT: It turns out Realm Sync doesn't support sharded clusters so chunk migrations not working with preImageRecordingEnabled=true may be expected.

Generated at Thu Feb 08 05:51:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.