Create dashboard to provide visibility into whether validations are working as intended and what impact they have

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description

      Validation enablement has the following goals (as defined in: Scope: Enable Resharding Validation):

      • Impact of validation on read/write latency is less than 5%.
      • No increase in the risk of critical section timeout
      • No significant increase in the time for cloning. 

      As part of SPM-4144, we ran performance testing to verify those goals are met and added best-effort safeguards so that post-cloning validation and final collection verification are skipped if they would materially extend cloning duration or cause critical section timeout risk. 

      However, we still need fleetwide production observability to verify that these assumptions hold in practice and to detect regressions early enough to disable validations before they become customer-visible. 

      Mitigation

      Dashboards

      A fleetwide aggregated metrics dashboard with the following charts:

      Chart Why it helps
      Validation success rate by collection size bucket. Monitor if validation code is working fleetwide.
      Average donor clone count duration relative to cloning phase duration Monitor if donor clone counts take a large percent of cloning duration (we do not expect it to)
      Critical section timeout percentage on clusters with and without validations enabled. Monitor if validations result in increased critical section timeout.
      P90, P99 and average read, insert, update, and delete latency on clusters with and without validations enabled Monitor if validations result in increased CRUD latency.

      The relevant FTDC metrics to create those charts should have already been added as part of: SERVER-129151

      We should also consider figuring out a way to monitor latency impact if possible. An initial idea of a chart with the following was proposed 

      P90, P99 and average read, insert, update, and delete latency on clusters with and without validations enabled

      But randolph brought up how: "Not sure if this is a good metric since a latency of one cluster is not always comparable with another cluster. This means that the number can already be noisy even without considering resharding validation"


      If possible create a document that shows example dashboard with charts to align the team. 

            Assignee:
            Unassigned
            Reporter:
            Wenqin Ye
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: