Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-64225

Change stream preImage writes on a primary interleave with large replica set transaction oplog entries



    • Bug
    • Status: Closed
    • Critical - P2
    • Resolution: Fixed
    • None
    • 6.0.0-rc0
    • None
    • None
    • Fully Compatible
    • ALL
    • QE 2022-04-04
    • 120


      A "large" replica set transaction is one that writes multiple oplog entries. In practice this is used to replicate more than 16 MB of oplog data without violating the 16 MB bson limit. Consider a multiple statement transaction that performs 2 writes (updates) that will use 2 oplog entries:

      • Update A, $set: {x: <16MB>}
      • Update B, $set: {x: <16MB>}

      A primary makes the following calls into a single WT transaction which allocate oplog timestamps 1 and 2:

      1. Write A -> {_id: "A", x: <16MB>}
      2. Write B -> {_id: "B", x: <16MB>}
      3. SetTimestamp 1
      4. Write Oplog(TS(1)) -> {applyOps: [{update A, set x...}], ts: 1, partialTxn: true}
      5. Write config.transactions {startOpTime: TS(1), lastWriteOpTime: TS(1), state: "inProgress", txnNum: 0}
      6. SetTimestamp 2
      7. Write Oplog(TS(2)) -> {applyOps: [{update B, set x...}], ts: 2}
      8. Write config.transactions {lastWriteOpTime: TS(2), state: "committed"}

      This write violates the multi timestamp constraint that states:

      • If a transactions sets multiple timestamps then
      • it must set a timestamp before performing any writes

      That violation was considered benign at the time and the invariant was relaxed. As far as I know, that sequence is still safe.

      However, when the example's updates touch a collection that is recording pre images for change streams, there are some new writes in the above transaction which do not get proper timestamps assigned. Specifically, after each oplog entry, preImages are written for the documents touched in that (potentially partial) oplog entry.

      In the example, we make two preImage writes that get assigned the following timestamps:

      • PreImage(A) @ TS(1)
      • PreImage(B) @ TS(2)

      The transaction technically commits at 2 and all of the data writes (both A and B) only become visible/persist when reading at 2. However when reading at TS(1), the preImage write still shows in the preimages table. That alone isn't sufficient to describe a data corruption scenario.

      Instead, consider the case where TS(1) replicates to secondaries, but TS(2) does not. If an election splits the two oplog entries such that the new primary only has TS(1) in its history, the transaction is treated as having been rolled back. The PreImage related to the first oplog entry must not be persisted. But WT's rollback_to_stable on the old primary does not discard it (as per the contract) given it was assigned TS(1).


        Issue Links



              mindaugas.malinauskas@mongodb.com Mindaugas Malinauskas
              daniel.gottlieb@mongodb.com Daniel Gottlieb
              0 Vote for this issue
              10 Start watching this issue