Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-33126

Replication commit point can include uncommitted storage transactions

    • Fully Compatible
    • ALL
    • Repl 2018-02-26

      When replication runs in a mode containing only one voting member, its commit point, more-or-less, tracks the primary's last applied optime.

      However, this algorithm does not consider in-flight operations, which may have already been assigned an earlier optime. In these cases, it's possible for replication to advance the local "last applied" optime to T, and then learn of an operation that completes at T-1. In the constrained scenario of one voting node, this means replication can advance the commit point to T before T-1 commits in the storage layer. The problem can be restated as, replication's commit point does not respect oplog visibility.

      When there are multiple voting nodes, those replicating nodes only learn of operation T after T-1 has become visible, thus preventing the commit point from advancing in the face of concurrent operations.

      It's unclear if this premature setting of the commit timestamp breaks any assumptions within replication. At a low level, earlier commit timestamps is documented as a result of heartbeats coming in out of order. However, storage is sensitive to storage transactions committing at a time before the oldest_timestamp (the replica set commit point), and in turn intentionally lag the oldest_timestamp.

      If replications feels this is correct behavior, this ticket would need to turn into a storage ticket such that consumers of setStableTimestamp process the input against the oplog read timestamp before propagating the [stable timestamp/oldest timestamp] value to WiredTiger.

            daniel.gottlieb@mongodb.com Daniel Gottlieb (Inactive)
            daniel.gottlieb@mongodb.com Daniel Gottlieb (Inactive)
            0 Vote for this issue
            6 Start watching this issue