[SERVER-64225] Change stream preImage writes on a primary interleave with large replica set transaction oplog entries Created: 04/Mar/22  Updated: 29/Oct/23  Resolved: 28/Mar/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.0.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Daniel Gottlieb (Inactive) Assignee: Mindaugas Malinauskas
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-48771 Enforce constraints on "multi-timesta... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: QE 2022-04-04
Participants:
Linked BF Score: 120

 Description   

A "large" replica set transaction is one that writes multiple oplog entries. In practice this is used to replicate more than 16 MB of oplog data without violating the 16 MB bson limit. Consider a multiple statement transaction that performs 2 writes (updates) that will use 2 oplog entries:

  • Update A, $set: {x: <16MB>}
  • Update B, $set: {x: <16MB>}

A primary makes the following calls into a single WT transaction which allocate oplog timestamps 1 and 2:

  1. Write A -> {_id: "A", x: <16MB>}
  2. Write B -> {_id: "B", x: <16MB>}
  3. SetTimestamp 1
  4. Write Oplog(TS(1)) -> {applyOps: [{update A, set x...}], ts: 1, partialTxn: true}
  5. Write config.transactions {startOpTime: TS(1), lastWriteOpTime: TS(1), state: "inProgress", txnNum: 0}
  6. SetTimestamp 2
  7. Write Oplog(TS(2)) -> {applyOps: [{update B, set x...}], ts: 2}
  8. Write config.transactions {lastWriteOpTime: TS(2), state: "committed"}

This write violates the multi timestamp constraint that states:

  • If a transactions sets multiple timestamps then
  • it must set a timestamp before performing any writes

That violation was considered benign at the time and the invariant was relaxed. As far as I know, that sequence is still safe.

However, when the example's updates touch a collection that is recording pre images for change streams, there are some new writes in the above transaction which do not get proper timestamps assigned. Specifically, after each oplog entry, preImages are written for the documents touched in that (potentially partial) oplog entry.

In the example, we make two preImage writes that get assigned the following timestamps:

  • PreImage(A) @ TS(1)
  • PreImage(B) @ TS(2)

The transaction technically commits at 2 and all of the data writes (both A and B) only become visible/persist when reading at 2. However when reading at TS(1), the preImage write still shows in the preimages table. That alone isn't sufficient to describe a data corruption scenario.

Instead, consider the case where TS(1) replicates to secondaries, but TS(2) does not. If an election splits the two oplog entries such that the new primary only has TS(1) in its history, the transaction is treated as having been rolled back. The PreImage related to the first oplog entry must not be persisted. But WT's rollback_to_stable on the old primary does not discard it (as per the contract) given it was assigned TS(1).



 Comments   
Comment by Githook User [ 25/Mar/22 ]

Author:

{'name': 'Mindaugas Malinauskas', 'email': 'mindaugas.malinauskas@mongodb.com'}

Message: SERVER-64225 Write change stream pre-images at transaction commit timestamp
Branch: master
https://github.com/mongodb/mongo/commit/ad16667902611f88a442392e1607902fe130d4bc

Comment by Matthew Russotto [ 07/Mar/22 ]

cc mindaugas.malinauskas

Generated at Thu Feb 08 05:59:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.