Investigate changes in SERVER-123147: Add serverStatus metrics for pre-images sampling

XMLWordPrintableJSON

    • Type: Investigation
    • Resolution: Won't Do
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Tools and Replicator

      Original Downstream Change Summary

      This change adds the following cumulative, process-wide serverStatus metrics for change streams pre-images collection sampling:

      • changeStreamPreImages.markerCreation.totalPass: total number of pre-image sampling passes. On aggregated storage clusters, the pre-images sampling is supposed to happen shortly after node start, on all node types of a cluster (primary and secondaries). The pre-images are typically not resampled during the lifetime of a node. On disaggregated storage clusters, pre-images sampling happens on every node step-up as a primary. Sampling can happen multiple times during the lifetime of a node.
      • changeStreamPreImages.markerCreation.scannedInternalCollections: total number of user collection UUID ranges sampled in the system pre-images collection. When a sampling pass occurs, this metric is typically increased by the number of user collections that have change streams pre-images enabled and contain active pre-images.
      • changeStreamPreImages.markerCreation.timeElapsedMillis: cumulated number of milliseconds spent on pre-images sampling.

      These metrics will become available with v9.0. They can be used to assess the number of sample passes for the documents in the change streams pre-images collection, and the amount of time the sampling takes. This is useful information because the sampling can compete for the same resources (CPU, IOPS) as user workloads.
      On aggregated storage clusters, change streams pre-images collection sampling is only supposed to happen at node startup. Overall, sampling should not have a large impact on performance as it happens infrequently.
      On disaggregated storage clusters, sampling of the documents in the change streams pre-images collection however happens on every step-up of a node as a primary. Thus it can happen multiple times during the lifetime of a node, and it is conceivable that the sampling slightly degrades the performance of other workloads shortly after the step-up completes.
      The metrics added by this SERVER ticket provide insights into the amount of sample passes and the amount of time the sampling took. They can be used to verify if sampling passes overlap with degradations in user workloads.

      If sampling is found to be too expensive, it is possible to reduce the amount of sample points taken by setting the value of the server parameter changeStreamPreImagesSamplePointsPerUUID that has been introduced via SERVER-122854.

      Description of Linked Ticket

      Adds the following cumulative, process-wide serverStatus metrics for change streams pre-images collection sampling:

      • changeStreamPreImages.markerCreation.totalPass: total number of pre-image sampling passes. On ASC, the pre-images sampling is supposed to happen shortly after node start, on all node types of a cluster (primary and secondaries). The pre-images are typically not resampled during the lifetime of a node. On DSC, pre-images sampling happens on every node step-up as a primary. Sampling can happen multiple times during the lifetime of a node.
      • changeStreamPreImages.markerCreation.scannedInternalCollections: total number of user collection UUID ranges sampled in the system pre-images collection. When a sampling pass occurs, this metric is typically increased by the number of user collections that have change streams pre-images enabled and contain active pre-images.
      • changeStreamPreImages.markerCreation.timeElapsedMillis: cumulated number of milliseconds spent on pre-images sampling.

            Assignee:
            Unassigned
            Reporter:
            Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: