[SERVER-65093] Count operation failures in serverStatus broken down by error code Created: 30/Mar/22  Updated: 30/Jan/24

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: David Storch Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: query-product-scope-1, query-product-urgency-2, query-product-value-2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-73524 Report a histogram of error codes rat... Open
is related to SERVER-67699 Add tracking for when change stream e... Closed
Assigned Teams:
Query Execution
Participants:

 Description   

I'm filing this based on my understanding of a proposal from joe.sack. We currently count how many times each command runs in total, and of these runs how many fail. However, there is no indication in serverStatus of the cause of failure. There is interest from query product management around knowing what the most frequent errors are. For instance, this came up in the context of tracking how many find or aggregate commands fail because their memory budget was exhausted and spilling to disk was disabled. I could also imagine interest in tracking transient or retryable errors which can require client-side retries. Or I could imagine having a bug which causes correct queries to fail spuriously with some internal unnamed error code, and having this data would allow us to assess the prevalence of the issue in Atlas.

I can imagine two different ways of displaying this data. Option one would be to report counts of error codes across all commands:

errorCodes: {
    DuplicateKey: 123,
    Location4567800: 456,
   ...

Alternatively, we could take up more space and present a more granular view where this data is presented on a per-command basis:

MongoDB Enterprise > db.serverStatus().metrics.commands
...
	"find" : {
		"failed" : 7,
                 "errorCodes": {
                      "DuplicateKey": 3,
                      "Location4567800": 4,
                 }
		"total" : NumberLong(100)
	},
...



 Comments   
Comment by Bruce Lucas (Inactive) [ 31/Mar/22 ]

A couple of considerations from an FTDC perspective, particularly for the second proposal:

  • How many new metrics does this introduce? I suspect it's not enormous as the number of different commands that fail times the number of ways they fail is probably not huge, but still we should confirm that.
  • Do we emit all possible error codes with a count of 0 in each sample, or do we emit only the error codes with a count > 0? The former will be a lot more metrics, but the latter will introduce schema changes that hinder FTDC compression. Again I suspect the number of schema changes will not be large if the number of metrics is not large, but we should confirm.

Alternatively if this is only needed in Atlas pings and not in FTDC then the issue can be avoided entirely. We should think about how much diagnostic value this would have for FTDC.

Comment by Joe Sack (Inactive) [ 30/Mar/22 ]

Thank you for articulating this David, this is aligned with what I was
raising. Essentially a heat map by error type that can help us prioritize
specific pain points and also measure the impact of new features (baseline
vs. post-feature rollout).

Generated at Thu Feb 08 06:01:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.