Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-65093

Count operation failures in serverStatus broken down by error code

    • Query Execution

      I'm filing this based on my understanding of a proposal from joe.sack. We currently count how many times each command runs in total, and of these runs how many fail. However, there is no indication in serverStatus of the cause of failure. There is interest from query product management around knowing what the most frequent errors are. For instance, this came up in the context of tracking how many find or aggregate commands fail because their memory budget was exhausted and spilling to disk was disabled. I could also imagine interest in tracking transient or retryable errors which can require client-side retries. Or I could imagine having a bug which causes correct queries to fail spuriously with some internal unnamed error code, and having this data would allow us to assess the prevalence of the issue in Atlas.

      I can imagine two different ways of displaying this data. Option one would be to report counts of error codes across all commands:

      errorCodes: {
          DuplicateKey: 123,
          Location4567800: 456,
         ...
      

      Alternatively, we could take up more space and present a more granular view where this data is presented on a per-command basis:

      MongoDB Enterprise > db.serverStatus().metrics.commands
      ...
      	"find" : {
      		"failed" : 7,
                       "errorCodes": {
                            "DuplicateKey": 3,
                            "Location4567800": 4,
                       }
      		"total" : NumberLong(100)
      	},
      ...
      

            Assignee:
            backlog-query-execution Backlog - Query Execution
            Reporter:
            david.storch@mongodb.com David Storch
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: