-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Storage Execution
We often run into issues with determining whether a cluster is sized appropriately for a time-series workload, which often comes down to the cardinality of the workload. Prospective customers sometimes struggle to identify their workload cardinality, and even sophisticated users sometimes have trouble identifying when something goes wrong and their model doesn't match reality.
Adding explicit tracking (even approximate) for the cardinality of the metaField would help both us and our customers to diagnose sizing and modeling issues much more quickly.
We would likely not need to persist these estimations anywhere, as tracking the working set (as opposed to the full historical collection data) is the main goal. Concise cardinality estimators like HyperLogLog and its variants could be used to do this efficiently.
We will likely want to track server-global cardinality at a minimum, as this is the most important for determining performance and sizing, and can offer limited insight into modeling issues. The global numbers can be reported via serverStatus and FTDC. If we determine that the memory and performance overhead of tracking this on a per-collection level is acceptable, then doing so and reporting via collStats should give finer-grained detail for workloads that utilize multiple collections.