Determine loggability for ChangeStreamHistoryLost errors

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Execution
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      We have multiple places in the code base from which we throw ChangeStreamHistoryLost exceptions.

      These places treat the occurrence of a ChangeStreamHistoryLost exception differently w.r.t. logging the event:

      • v1 change streams only: when a change stream is opened, and a shard was removed after the change stream open time, a ChangeStreamHistoryLost exception is thrown, no logging happens. code
      • when a transaction in a change stream is unwound, and the timestamp is not part of the oplog anymore, no logging happens. code code
      • when a change stream is opened with a resume token/start time, which has already fallen off the oplog, an error log message is emitted. code

      It is questionable/inconsistent that we log there ChangeStreamHistoryLost exception in one of the places but not the others.

      Another question is if it should be logged at all, and if, if log severity "error" is appropriate. Probably error is a too high severity, because falling off the oplog is not just possible and happens in reality, but can be easily forced by opening a change stream at a very early start time, e.g. Timestamp(1, 1). In this case the logs could be flooded with such errors that do not provide any value to users.

      Using log severity "error" is especially worrying if errors are used for alerting and should be reserved for severe problems.

      Falling off the oplog can be considered severe by some users/use cases, however, in that case, we should treat all occurrences of ChangeStreamHistoryLost identically and use the same logging behavior.

       

      romans.kasperovics@mongodb.com also suggested a way to distinguish between the following two error cases in which ChangeStreamHistoryLost exceptions are caught and handled by the ChangeStreamCheckResumabilityStage here:

      1. the change stream is opened with a resume token / start time that has already fallen off the oplog. Opening the change stream immediately fails. This is probably an expected case, and may not deserve logging.
      2. the change stream was already opened successfully and potentially already returned results to the consumer. However, it is processed too slowly later so that it eventually falls off the oplog. This is likely an unexpected case and potentially justifies logging.

      Currently we have no way to tell these two cases apart in our code, but it would be good to add such capability so that we have the option to treat and handle them differently, e.g. with different logging activity.

            Assignee:
            Unassigned
            Reporter:
            Jan Steemann
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: