[SERVER-55305] Retryable write may execute more than once if primary had transitioned through rollback to stable Created: 18/Mar/21  Updated: 29/Oct/23  Resolved: 05/May/21

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: 4.0.0, 4.2.0, 4.4.0
Fix Version/s: 5.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Jason Chan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Problem/Incident
Related
related to SERVER-55578 Disallow atClusterTime reads on the c... Closed
is related to SERVER-54626 Retryable writes may execute more tha... Closed
is related to SERVER-55214 Resharding txn cloner can miss config... Closed
is related to SERVER-56631 Retryable write pre-fetch phase could... Closed
is related to SERVER-56796 Support atClusterTime snapshot reads ... Backlog
is related to SERVER-47844 Update _setStableTimestampForStorage ... Closed
is related to SERVER-47845 Remove obsolete code related to stori... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2, v4.0
Sprint: Repl 2021-04-19, Repl 2021-05-03, Repl 2021-05-17
Participants:
Linked BF Score: 70

 Description   

SessionUpdateTracker::_updateSessionInfo() is used by secondary oplog application to coalesce multiple updates to the same config.transactions record into a single update of the most recent retryable write statement. The changes from 02020fa as part of SERVER-47844 made it possible for a secondary to choose its stable_timestamp as a majority-committed timestamp from within an oplog batch rather than always being on a batch boundary. The combination of these two can lead to the following sequence:

  1. During single batch of oplog application:
    1. User data write for stmtId=0 at t=10.
    2. User data write for stmtId=1 at t=11.
    3. User data write for stmtId=2 at t=12.
    4. Session txn record write at t=12 with stmtId=2 as lastWriteOpTime.
      • In particular, no session txn record write for t=10 with stmtId=0 as lastWriteOpTime or for t=11 with stmtId=1 as lastWriteOpTime because they were coalseced by the SessionUpdateTracker.
  2. Rollback to stable timestamp t=10.
  3. The session txn record won't exist with stmtId=0 as lastWriteOpTime (because the write was entirely skipped by oplog application) despite the user data write for stmtId=0 being reflected on-disk. This allows stmtId=0 to be re-executed by this node if it became primary.
Impact on 4.0, 4.2, and 4.4 branches

The stable optime candidates list prevents this issue for retryable inserts, updates, and deletes applied during secondary oplog application.

However, retryable inserts on primaries also coalesce multiple updates to the same config.transactions record into a single update of the most recent retryable write statement. This happens through OpObserverImpl::onInserts() calling TransasctionParticipant::onWriteOpCompletedOnPrimary() once for a batch of insert statements (aka vectored insert).

  • Only retryable inserts are impacted.
  • A retry attempt fails with a DuplicateKey error so long as the document wasn't deleted by another client in the meantime. (The document is re-inserted otherwise.)
Impact on 4.9 and master branches

The stable optime candidates list was removed and so this issue exists for retryable inserts, updates, and deletes applied during secondary oplog application. Retryable inserts on primaries continue to coalesce multiple updates to the same config.transactions record into a single update of the most recent retryable write statement.

  • All of retryable inserts, updates, and deletes are impacted.
  • A retry attempt for an update can execute more than once (e.g. double increment a counter).

This issue was discovered while reasoning through why the atClusterTime read on config.transactions to fix SERVER-54626 was insufficient (hence SERVER-55214). Shout out to daniel.gottlieb for the assist!



 Comments   
Comment by Githook User [ 04/May/21 ]

Author:

{'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}

Message: SERVER-55305 Add new step to replication rollback to restore the txns table to be consistent with the stableTimestamp
Branch: master
https://github.com/mongodb/mongo/commit/320e2d28396c250dfd69640dcf865dff50ea0b55

Comment by Githook User [ 27/Apr/21 ]

Author:

{'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}

Message: SERVER-55305 Do not coalesce updates across different txnNumbers for retryableWrites
Branch: master
https://github.com/mongodb/mongo/commit/70a27d12642793fed2b05da116910799ea67c4f6

Comment by Max Hirschhorn [ 27/Mar/21 ]

Reassigning this ticket to the Replication team based on the Slack discussion on what the proposed fix would look like. The idea is to add another step to the rollback-to-stable procedure to fix up the config.transactions collection. Putting the onus on rollback avoids any performance impact on the write path on primaries and secondaries. Note that the following proposal won't address how atClusterTime reads on a primary and secondary may return different results for the config.transactions collection. But it will fully address the implications of updates to a config.transactions record being coalesced into a single write on primaries and secondaries on retryable writes' exactly-once semantics. (I have also filed SERVER-55578 to disallow atClusterTime reads on the config.transactions collection outside of resharding.)

(Before rollback) A config.transactions record with a lastWriteOpTime.ts > stable_timestamp may have been coalesced with other updates to the config.transactions record where the user data write was timestamped <= stable_timestamp. Such a config.transactions record need to be fixed up because rollback-to-stable would effectively restore the config.transactions record to a timestamp from before the timestamp of the most recent user data write <= stable_timestamp.

                 ST
                 v
+--------+-------|-------+
|   A1   |   A2     A3   |
+------------------------+

Let's say A1, A2, and A3 are 3 statements performed in the same retryable write with A1 applied in its own batch and A2 and A3 applied in another batch together. At stable_timestamp, the config.transactions record's lastWriteOpTime would be A1. This is because secondary oplog application only did a single write to the config.transactions record timestamped at A3. However, the user data write for A2 is reflected in the stable_timestamp and therefore the config.transactions record's lastWriteOpTime must be updated to A2 to prevent the A2 statement from being re-executed if the node became primary.

(After rolling back the data to stable_timestamp and before truncating the oplog) The rollback-to-stable procedure would scan forward through the oplog starting from [stable_timestamp + 1] (these are the oplog entries being rolled back):

  • (a) If the oplog entry isn't for a retryable write, then skip over it. Notably, multi-statement transactions do not coalsece any writes to the config.transactions collection and so any write to the config.transactions record from an earlier retryable write statement would have been timestamped <= stable_timestamp already.
  • (b) If the oplog entry has a null prevOpTime (these are oplog entries which begin a retryable write), then skip over it.
  • (c) If the oplog entry has prevOpTime <= stable_timestamp, then attempt to look up oplog entry:
    • (c.i) If the oplog entry is found, then restore the lastWriteOpTime, lastWriteDate, and txnNum from the {t, ts}, wall, and txnNumber of the oplog entry found via prevOpTime, respectively.
    • (c.ii) If the oplog entry isn't found due to the oplog having been truncated before the stable_timestamp (via the OplogCapMaintainerThread), then restore the lastWriteOpTime and txnNum from the prevOpTime and txnNumber of the oplog entry being scanned, respectively. A value for the lastWriteDate must be faked because the true value has already been truncated from the oplog. A lastWriteOpTime value that isn't present in the node's oplog would cause it to return IncompleteTransactionHistory as a primary if more statements for the txnNum retryable write were performed (thereby preventing multiple executing without being able to answer if the write had happened at all), while still allowing oplog application of more statements for a txnNum retryable write as a secondary (where the primary generating the oplog entries must still have had enough oplog to prove each retryable write statement executed only once).
  • (d) If the oplog entry has prevOpTime > stable_timestamp, then skip over it. These oplog entries are safe to ignore because case (c) above would have already handled updating the config.transactions record from an earlier scanned oplog entry.

The correctness of case (b) requires also updating secondary oplog application to not coalesce multiple updates to the same config.transactions record across different txnNumbers.

                 ST
                 v
+--------+-------|-------+
|   A1   |   A2     B1   |
+------------------------+

Let's say instead of A3 as a third statement in the same retryable write, it was actually B1 of a single statement in the subsequent txnNumber. At stable_timestamp, the config.transactions record's lastWriteOpTime would still be A1. The config.transactions record's lastWriteOpTime must still be updated to A2 to prevent the A2 statement from being re-executed if the node became primary. While a well-behaved driver wouldn't knowingly resend the txnNumber for A2 again, retryable writes are also designed to protect against multiple execution of writes from a delayed message that had gotten stuck in a network switch. Moreover, replica sets are not guaranteed to retain oplog entries from before stable_timestamp and so the rollback-to-stable procedure must be able to have the config.transactions collection become correct without knowing the true lastWriteOpTime and txnNum for the config.transactions record at stable_timestamp from its oplog entries.

Changing SessionUpdateTracker to not coalesce multiple updates to the same config.transactions record across different txnNumbers would mean that secondary oplog application did two writes to the config.transactions record - the first timestamped at A2 and the second timestamped at B1. There is no additional work for rollback-to-stable to do because there is already a write to the config.transactions record timestamped at A2. While patch builds as part of SERVER-55214 demonstrated a 70-80% performance regression for vectored insert when coalescing updates to the same config.transactions record was turned completely off, it is believed that multiple retryable writes on the same logical session are unlikely to end up in the same oplog batch and so there wouldn't be a noticeable performance impact. It may be valuable to test the SessionUpdateTracker change first in isolation and confirm this belief before starting work on the rest of the changes to the rollback-to-stable procedure.

Generated at Thu Feb 08 05:36:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.