[SERVER-55305] Retryable write may execute more than once if primary had transitioned through rollback to stable Created: 18/Mar/21 Updated: 29/Oct/23 Resolved: 05/May/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Sharding |
| Affects Version/s: | 4.0.0, 4.2.0, 4.4.0 |
| Fix Version/s: | 5.0.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | Jason Chan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.4, v4.2, v4.0
|
||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2021-04-19, Repl 2021-05-03, Repl 2021-05-17 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 70 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
SessionUpdateTracker::_updateSessionInfo() is used by secondary oplog application to coalesce multiple updates to the same config.transactions record into a single update of the most recent retryable write statement. The changes from 02020fa as part of
Impact on 4.0, 4.2, and 4.4 branchesThe stable optime candidates list prevents this issue for retryable inserts, updates, and deletes applied during secondary oplog application. However, retryable inserts on primaries also coalesce multiple updates to the same config.transactions record into a single update of the most recent retryable write statement. This happens through OpObserverImpl::onInserts() calling TransasctionParticipant::onWriteOpCompletedOnPrimary() once for a batch of insert statements (aka vectored insert).
Impact on 4.9 and master branchesThe stable optime candidates list was removed and so this issue exists for retryable inserts, updates, and deletes applied during secondary oplog application. Retryable inserts on primaries continue to coalesce multiple updates to the same config.transactions record into a single update of the most recent retryable write statement.
This issue was discovered while reasoning through why the atClusterTime read on config.transactions to fix |
| Comments |
| Comment by Githook User [ 04/May/21 ] | ||||||||||
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}Message: | ||||||||||
| Comment by Githook User [ 27/Apr/21 ] | ||||||||||
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}Message: | ||||||||||
| Comment by Max Hirschhorn [ 27/Mar/21 ] | ||||||||||
|
Reassigning this ticket to the Replication team based on the Slack discussion on what the proposed fix would look like. The idea is to add another step to the rollback-to-stable procedure to fix up the config.transactions collection. Putting the onus on rollback avoids any performance impact on the write path on primaries and secondaries. Note that the following proposal won't address how atClusterTime reads on a primary and secondary may return different results for the config.transactions collection. But it will fully address the implications of updates to a config.transactions record being coalesced into a single write on primaries and secondaries on retryable writes' exactly-once semantics. (I have also filed (Before rollback) A config.transactions record with a lastWriteOpTime.ts > stable_timestamp may have been coalesced with other updates to the config.transactions record where the user data write was timestamped <= stable_timestamp. Such a config.transactions record need to be fixed up because rollback-to-stable would effectively restore the config.transactions record to a timestamp from before the timestamp of the most recent user data write <= stable_timestamp.
Let's say A1, A2, and A3 are 3 statements performed in the same retryable write with A1 applied in its own batch and A2 and A3 applied in another batch together. At stable_timestamp, the config.transactions record's lastWriteOpTime would be A1. This is because secondary oplog application only did a single write to the config.transactions record timestamped at A3. However, the user data write for A2 is reflected in the stable_timestamp and therefore the config.transactions record's lastWriteOpTime must be updated to A2 to prevent the A2 statement from being re-executed if the node became primary. (After rolling back the data to stable_timestamp and before truncating the oplog) The rollback-to-stable procedure would scan forward through the oplog starting from [stable_timestamp + 1] (these are the oplog entries being rolled back):
The correctness of case (b) requires also updating secondary oplog application to not coalesce multiple updates to the same config.transactions record across different txnNumbers.
Let's say instead of A3 as a third statement in the same retryable write, it was actually B1 of a single statement in the subsequent txnNumber. At stable_timestamp, the config.transactions record's lastWriteOpTime would still be A1. The config.transactions record's lastWriteOpTime must still be updated to A2 to prevent the A2 statement from being re-executed if the node became primary. While a well-behaved driver wouldn't knowingly resend the txnNumber for A2 again, retryable writes are also designed to protect against multiple execution of writes from a delayed message that had gotten stuck in a network switch. Moreover, replica sets are not guaranteed to retain oplog entries from before stable_timestamp and so the rollback-to-stable procedure must be able to have the config.transactions collection become correct without knowing the true lastWriteOpTime and txnNum for the config.transactions record at stable_timestamp from its oplog entries. Changing SessionUpdateTracker to not coalesce multiple updates to the same config.transactions record across different txnNumbers would mean that secondary oplog application did two writes to the config.transactions record - the first timestamped at A2 and the second timestamped at B1. There is no additional work for rollback-to-stable to do because there is already a write to the config.transactions record timestamped at A2. While patch builds as part of |