Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-89948

Refactor oplog application for observability

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 8.1.0-rc0, 8.0.0-rc8
    • Affects Version/s: None
    • Component/s: None
    • None
    • Replication
    • Fully Compatible
    • v8.0
    • Repl 2024-05-13, Repl 2024-05-27, Repl 2024-06-10
    • 200

      There are some pain points when it comes to debugability/observability from the oplog applier:

      1. If an error happens in applyOplogEntryOrGroupedInsertsCommon, we may return an error status that gets converted into Status::OK() if a node is in initial sync or other recovery modes. However, the caller of the function will continue to log that we have applied that operation, even if it was unsuccessful. We should expose the conversion or otherwise avoid logging Applied op if the apply was unsuccessful
      2. There are multiple layers where we do try/catch blocks. For instance, in applyOplogEntryOrGroupedInsertsCommon, then again in applyOplogBatchCommon. We should unify the way we catch errors.
      3. Related to point 2, today functions in the oplog applier code path can either return error statuses, or throw an error status. As a result, it is confusing where exactly an error came from. We should either always throw an error status, or catch an error status at the lowest level and return it all the way up.
      4. We catch errors and convert them to Status::OK() if the node is not in secondary oplog application mode. We should log a message when this occurs to make it clear what failures occurred, as this could lead to hidden data inconsistency.

      EDIT: After triage today, we will focus this ticket on pain points #1 and #4. SERVER-78834 may refactor a lot of the code that affects #2 and #3.

            Assignee:
            wenbin.zhu@mongodb.com Wenbin Zhu
            Reporter:
            xuerui.fa@mongodb.com Xuerui Fa
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: