Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-33359

Have RTT storage engines manage rolling back incomplete index builds.

    • Fully Compatible
    • Repl 2018-02-26

      In an RTT world, nodes must retain enough history to undo writes back to the replica set commit point. However, replication does not always communicate enough information to reliably know what the state should be at the commit point.

      Specifically, the view of a database when an index build is occurring is not sufficient to know whether the index creation had committed, or it was rolled back. The lifetime of an index build on a primary:

      1. Start an index build at time S. This writes an entry to the local catalog with a `ready: false` flag. S is not known to any other node.
      2. Build the index
      3. Complete the index build at time F. This atomically commits the index, setting `ready: true` and writing an oplog entry with time `F`.

      An index build on a secondary:

      1. Observe an oplog entry to create an index with time `F`. Add the `ready: false` entry to the catalog.
      2. Build the index
      3. Complete index build, set the index to `ready: true`. A foreground index will finish at time `F`. A background index finishes at time `BF`.

      Storage only knows about the existence of the catalog entry and the value of its `ready` flag. to make a decision on whether the index build should be completed. In both cases, if the entry does not exist, the index should be removed. If the entry does exist with `ready: true`, it should remain.

      However, the primary case should roll back the case where the index entry exists, but is `ready: false` as this represents a time before it wrote out the oplog entry. A secondary must keep the index given the same inputs as the oplog entry was written before the index build started. Losing the index would be a bug.

      This ticket is for index builds to communicate enough information to the storage engine to disambiguate the decision. The storage engine may choose persist this information as it sees fit. It's only expected to be of value for RTT storage engines.

      Specifically what will be communicated is whether the index to be built is a background index build being started on a secondary. Foreground index builds on a secondary will not show as "in progress" following a call to RTT, either the index is not in the catalog, or the entry exists and the index is usable.

            daniel.gottlieb@mongodb.com Daniel Gottlieb (Inactive)
            daniel.gottlieb@mongodb.com Daniel Gottlieb (Inactive)
            0 Vote for this issue
            4 Start watching this issue