Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-98673

Track approximate cardinality of time-series metaField

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Storage Execution

      We often run into issues with determining whether a cluster is sized appropriately for a time-series workload, which often comes down to the cardinality of the workload. Prospective customers sometimes struggle to identify their workload cardinality, and even sophisticated users sometimes have trouble identifying when something goes wrong and their model doesn't match reality.

      Adding explicit tracking (even approximate) for the cardinality of the metaField would help both us and our customers to diagnose sizing and modeling issues much more quickly.

      We would likely not need to persist these estimations anywhere, as tracking the working set (as opposed to the full historical collection data) is the main goal. Concise cardinality estimators like HyperLogLog and its variants could be used to do this efficiently.

      We will likely want to track server-global cardinality at a minimum, as this is the most important for determining performance and sizing, and can offer limited insight into modeling issues. The global numbers can be reported via serverStatus and FTDC. If we determine that the memory and performance overhead of tracking this on a per-collection level is acceptable, then doing so and reporting via collStats should give finer-grained detail for workloads that utilize multiple collections.

            Assignee:
            Unassigned Unassigned
            Reporter:
            dan.larkin-york@mongodb.com Dan Larkin-York
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: