-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Storage Execution
We sometimes find that we could benefit from additional metrics associated with the time-series write path when attempting to diagnose unexpected behavior with certain workloads. The following have been identified as likely useful to have:
- Per-stripe gauges of the current open, archived, and idle buckets.
- Finer granularity counters for the reasons why a bucket reopening failed, e.g. due to era mismatch, hash collision, or malformed bucket.
- Anywhere we have a retry loop, one counter that ticks on each execution of the loop, as well as one counter that only ticks on the first execution (i.e. to help us understand the average number of retries). Bucket reopening and _id generation are key examples.
- A counter for the number of times we remove a cleared bucket from the catalog.
- "Direct write" counters, particularly the number of bucket-level operations (insert, update, delete) due to both direct writes to system.buckets, as well as measurement-level updates and deletes. Separate metrics as much as as is reasonable for maximum visibility.
- A gauge for the current era span (the difference between the oldest and newest era with tracked buckets in the state registry).