Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Critical - P2
Fix Version/s: 6.0.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
QE 2022-04-04
Linked BF Score:
120
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

A "large" replica set transaction is one that writes multiple oplog entries. In practice this is used to replicate more than 16 MB of oplog data without violating the 16 MB bson limit. Consider a multiple statement transaction that performs 2 writes (updates) that will use 2 oplog entries:

Update A, $set: {x: <16MB>}
Update B, $set: {x: <16MB>}

A primary makes the following calls into a single WT transaction which allocate oplog timestamps 1 and 2:

Write A -> {_id: "A", x: <16MB>}
Write B -> {_id: "B", x: <16MB>}
SetTimestamp 1
Write Oplog(TS(1)) -> {applyOps: [{update A, set x...}], ts: 1, partialTxn: true}
Write config.transactions {startOpTime: TS(1), lastWriteOpTime: TS(1), state: "inProgress", txnNum: 0}
SetTimestamp 2
Write Oplog(TS(2)) -> {applyOps: [{update B, set x...}], ts: 2}
Write config.transactions {lastWriteOpTime: TS(2), state: "committed"}

This write violates the multi timestamp constraint that states:

If a transactions sets multiple timestamps then
it must set a timestamp before performing any writes

That violation was considered benign at the time and the invariant was relaxed. As far as I know, that sequence is still safe.

However, when the example's updates touch a collection that is recording pre images for change streams, there are some new writes in the above transaction which do not get proper timestamps assigned. Specifically, after each oplog entry, preImages are written for the documents touched in that (potentially partial) oplog entry.

In the example, we make two preImage writes that get assigned the following timestamps:

PreImage(A) @ TS(1)
PreImage(B) @ TS(2)

The transaction technically commits at 2 and all of the data writes (both A and B) only become visible/persist when reading at 2. However when reading at TS(1), the preImage write still shows in the preimages table. That alone isn't sufficient to describe a data corruption scenario.

Instead, consider the case where TS(1) replicates to secondaries, but TS(2) does not. If an election splits the two oplog entries such that the new primary only has TS(1) in its history, the transaction is treated as having been rolled back. The PreImage related to the first oplog entry must not be persisted. But WT's rollback_to_stable on the old primary does not discard it (as per the contract) given it was assigned TS(1).

is related to

SERVER-48771 Enforce constraints on "multi-timestamp" transactions

Closed

Assignee:: Mindaugas Malinauskas
Reporter:: Daniel Gottlieb (Inactive)
Participants:: Daniel Gottlieb, Githook User, Matthew Russotto, Mindaugas Malinauskas
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Mar 04 2022 07:42:45 PM UTC
Updated:: Oct 29 2023 09:41:35 PM UTC
Resolved:: Mar 28 2022 08:14:58 AM UTC
Confidence Status Last Update:: 23/Mar/22 11:47 AM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates