[SERVER-33126] Replication commit point can include uncommitted storage transactions Created: 05/Feb/18  Updated: 29/Oct/23  Resolved: 12/Feb/18

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: 3.7.2

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Daniel Gottlieb (Inactive)
Resolution: Fixed Votes: 0
Labels: rollback-functional
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-29213 Have KVWiredTigerEngine implement Sto... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2018-02-26
Participants:

 Description   

When replication runs in a mode containing only one voting member, its commit point, more-or-less, tracks the primary's last applied optime.

However, this algorithm does not consider in-flight operations, which may have already been assigned an earlier optime. In these cases, it's possible for replication to advance the local "last applied" optime to T, and then learn of an operation that completes at T-1. In the constrained scenario of one voting node, this means replication can advance the commit point to T before T-1 commits in the storage layer. The problem can be restated as, replication's commit point does not respect oplog visibility.

When there are multiple voting nodes, those replicating nodes only learn of operation T after T-1 has become visible, thus preventing the commit point from advancing in the face of concurrent operations.

It's unclear if this premature setting of the commit timestamp breaks any assumptions within replication. At a low level, earlier commit timestamps is documented as a result of heartbeats coming in out of order. However, storage is sensitive to storage transactions committing at a time before the oldest_timestamp (the replica set commit point), and in turn intentionally lag the oldest_timestamp.

If replications feels this is correct behavior, this ticket would need to turn into a storage ticket such that consumers of setStableTimestamp process the input against the oplog read timestamp before propagating the [stable timestamp/oldest timestamp] value to WiredTiger.



 Comments   
Comment by Githook User [ 12/Feb/18 ]

Author:

{'email': 'daniel.gottlieb@mongodb.com', 'name': 'Daniel Gottlieb', 'username': 'dgottlieb'}

Message: SERVER-33126: Ensure the stable timestamp does not race ahead of the oplog read timestamp.

When a replica set has only one voting member (the primary), replication can communicate
a commit timestamp that is ahead of other transactions being concurrently committed.
Branch: master
https://github.com/mongodb/mongo/commit/c62f2d9f8c5934f44541bc0d5adbb475df17e98e

Generated at Thu Feb 08 04:32:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.