[SERVER-73524] Report a histogram of error codes rather than just an error counter in serverStatus Created: 01/Feb/23  Updated: 16/Jan/24

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Charlie Swanson Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-73561 Consider exception origin when counti... Closed
related to SERVER-78458 Allow statuses/assertions to have bot... Closed
is related to SERVER-65093 Count operation failures in serverSta... Open
Assigned Teams:
Service Arch
Participants:

 Description   

The idea is to replace something like

errors: 5 

with something more like

errors: {
    26: 1,  /* NamespaceNotFound */
    50: 2,  /* MaxTimeMSExpired */
    59401: 2,  /* anonymous error code */
}    

We could do that both for this top level "asserts" section:

"asserts" : {
                "regular" : 0,
                "warning" : 0,
                "msg" : 0,
                "user" : 0,
                "tripwire" : 0,
                "rollovers" : 0
        },

But also for the "commands" section, or anywhere else we accumulate errors:

                "commands" : {
                        "buildInfo" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(3)
                        },
                        "createIndexes" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(3)
                        },
                        "find" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(21)
                        },

This will help gather more insight into what kinds of things are going wrong. For example if we start to see a lot more of a particular error code after upgrading to a new version.



 Comments   
Comment by Shameek Ray [ 28/Feb/23 ]

Thanks louis.williams@mongodb.com. Not sure this needs to be a high priority addition to our backlog, perhaps medium priority in the quick wins mix. Defer to jason.chan@mongodb.com / blake.oler@mongodb.com for further thoughts

Comment by Louis Williams [ 09/Feb/23 ]

shameek.ray@mongodb.com, I'm not sure, really. We plan on adding specific counters for specific index build errors, which is a trivial amount of work. Index builds are also very special in the code, and I'm not even sure if this proposal would account for "internal" operations like index builds, or if it would cover only user operations.

Comment by Shameek Ray [ 08/Feb/23 ]

louis.williams@mongodb.com - does the current Graceful Handling of Index Builds project add new any metrics as described in this ticket? If not, would such a new histogram of error codes be even more useful upon the completion of Graceful Handling of Index Builds?

blake.oler@mongodb.com / jason.chan@mongodb.com - this seems like a good ticket to include into the quick win mix. 

Comment by Eric Sedor [ 02/Feb/23 ]

Discussing with Bruce and we'd add: Please only add itemized errors when they occur. While this does result in a variable size document, the number of errors we have is low enough and the number of unique errors on a given deployment are low enough that we are not concerned about schema changes affecting retention

Comment by Bruce Lucas (Inactive) [ 02/Feb/23 ]

This looks useful, but from a diagnostic perspective I wonder if it might be better to keep the existing overall counters and add a new section alongside the that - e.g errorTypes, failureTypes, assertTypes - with a document like you describe above containing the counts broken out by type.

Also, for usability why not use the readable names as the key, e.g. "NamespaceNotFound", instead of the numbers?

Comment by Charlie Swanson [ 01/Feb/23 ]

I don't know whose backlog to put this on, guessed service architecture to start.

Generated at Thu Feb 08 06:24:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.