Disagg OplogProvider crashes on reseek when oplog truncation removes its last read position

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Replication
    • ALL
    • 200
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Bug

      The disagg OplogProvider ships oplog entries to the SLS log server. After a yield or on step-up it rebuilds its cursor and seeks back to its last read position via seekExact(lastRecordIdRead), asserting the record is found (oplog_provider.cpp#L296).

      Oplog truncation is independent of the provider's read position, so if it removes that entry, seekExact returns none and the node crashes.

      Fix

      The likely direction is to protect against truncation passing the resume point, similar to SERVER-128312 (which bounded truncation via computeTruncationBound()).
      A few things to work out:

      • What the resume point should anchor on. The provider's local read cursor seems insufficient, since on step-up the resume LSN comes from the remote log server (log_server_manager.cpp#L1302-L1309) with no check against the local oplog floor, so it may need to anchor on the SLS last-written LSN and hold on every electable node.
      • Whether a pin alone is enough, or step-up also needs to handle a missing resume point gracefully (refuse step-up / resync) rather than crash-looping. This probably depends on whether such a bound can be guaranteed on a stepping-up node.

            Assignee:
            Unassigned
            Reporter:
            Shin Yee Tan
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: