Investigate changes in SERVER-118737: Implement replicated table drops

XMLWordPrintableJSON

    • Type: Investigation
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Tools and Replicator
    • 2

      Original Downstream Change Summary

      This adds a new oplog command "dropIdent" which replicates the second phase of two-phase drop.

      Description of Linked Ticket

      When a collection or index is dropped, we immediately remove the entry from the MDB catalog, but enqueue the underlying table to be dropped later by KVDropPendingReaper. This is required for PIT reads on a dropped collection, which is allowed until the collection leaves the history window. Only once the drop timestamp has become less than the oldest_timestamp and we've taken a checkpoint which does not include the collection does the reaper actually call drop().

      This process means that tables are dropped at different times on different nodes, and a standby which installs a checkpoint may find either that a table it has dropped is unexpectedly present, or a table that it hasn't dropped yet is missing. To fix this, we need to replicate second phase of table drops in addition to the first.

      The new oplog entry for this should look something like:

       {
        "namespace": "2025-04-03T165859194.$cmd",
        "uuid": {
          "uuid": {
            "$uuid": "cce1a5ad-6fba-490d-834b-25059e75a7e9"
          }
        },
        "opTime": {
          "ts": {
            "$timestamp": {
              "t": 1743699539,
              "i": 1
            }
          },
          "t": 1
        },
        "o": {
          "dropTable": "collection-foo",
        },
      }
      

      A new function `void OpObserver::onTableDrop(OperationContext* opCtx, StringData ident, Timestamp ts)` should be added to OpObserver, and the OpObserverImpl implementation should write the new oplog entry.

      KVDropPendingIdentReaper::_tryToDrop() should replace the call to drop() with beginning a WUOW, reserving a timestamp, calling drop() with the timestamp, and then if drop succeeds calling onTableDrop() and committing the WUOW. If the drop fails the WUOW should be rolled back.

      On secondaries, applying the oplog entry will call StorageInterface::dropTable(ident, ts), which eventually calls into KVDropPendingReaper. The requested table must exist, be registered as a drop-pending ident, and be able to be dropped. Any of these not being the case will return an error Status, resulting in oplog application failing and the secondary being killed.

      The expected reason for this to fail is a long-running read which holds a snapshot longer than it is allowed to. The current solution for this is to use a longer history window on primaries than on standbys (30m vs 5m) to ensure that this only happens if something has gone significantly wrong rather than routinely happening with a lagging standby. The disagg PIT reads project is working on a better long-term solution to this problem.

      On standbys, the 24 hour delay on reaping tables should be replaced with never reaping tables except for via oplog application.

      Replicated drops should only be used for disagg. StorageEngineImpl::promoteToLeader() and void StorageEngineImpl::demoteFromLeader() are probably a good place to switch between the various modes. Initially KVDropPendingReaper should be in ASC mode, and then switch to disagg primary or disagg secondary when those functions are called. Note that demoteFromLeader() is called on startup in disagg.

            Assignee:
            Michael McClimon
            Reporter:
            Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: