[SERVER-51668] Report total CPU time spent by operations in serverStatus Created: 15/Oct/20  Updated: 29/Oct/23  Resolved: 09/Nov/20

Status: Closed
Project: Core Server
Component/s: Diagnostics
Affects Version/s: None
Fix Version/s: 4.9.0-alpha0

Type: Task Priority: Major - P3
Reporter: Eric Milkie Assignee: Louis Williams
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Problem/Incident
Backwards Compatibility: Fully Compatible
Sprint: Execution Team 2020-11-16
Participants:
Linked BF Score: 50

 Description   

Report the globally-aggregated CPU time spent by user operations to serverStatus:

{
...
  resourceConsumption: {      cpuNanos: <int>    }
  }

The "resourceConsumption" section is not included by default, and will only be available if the aggregateOperationResourceConsumptionMetrics setParameter is true.



 Comments   
Comment by Githook User [ 09/Nov/20 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-51668 Report total CPU time spent by operations in serverStatus
Branch: master
https://github.com/mongodb/mongo/commit/a09a1afbe18353ac5c865a643c97029e5cba4925

Comment by Bruce Lucas (Inactive) [ 05/Nov/20 ]

OK, thanks.

Comment by Louis Williams [ 05/Nov/20 ]

The motivation for doing this at all was to collect the CPU time and potentially backport so that we can compare against previous versions.

After talking with milkie, I'm not sure if there is even a use case for globally reporting all of these metrics in serverStatus.

bruce.lucas, I'm going to revise this ticket to only report CPU time when the setParameter is enabled. I will also disable reporting this metric when FTDC invokes serverStatus. 

Comment by Bruce Lucas (Inactive) [ 05/Nov/20 ]

So my sense is that these are of limited diagnostic value, so would be best not to include in FTDC. From that perspective the options would be

  • Don't include in serverStatus at all
  • Include in serverStatus but with that section disabled by default (using the existing mechanism for including/excluding serverStatus sections, independent of the aggregateOperationResourceConsumptionMetrics parameter)
  • Include in serverStatus with that section enabled by default, but disabled when FTDC invokes serverStatus (there's a simple mechanism in FTDC for that).

I don't have an opinion on which of those options is preferrable.

Comment by Louis Williams [ 04/Nov/20 ]

How would a globally aggregated cpuMillis differ from the CPU metrics we already collect - is it just that the latter includes things like eviction server threads and checkpoints not included in cpuMillis? I presume cpuMillis does include eviction done by application threads, which I think can often be the bulk of eviction. Any other differences?

bruce.lucas, the globally aggregated CPU time would only account for user operations and only a specific set of commands. I'm not sure how useful this will be for debugging.

Since serverStatus already has a mechanism for enabling or disabling specific sections when you issue the serverStatus command, do we really want a separate setParameter for this?

The setParameter already exists to support global metrics aggregation. So we will only report this information in serverStatus if we were are collecting this information in the first place.

Comment by Bruce Lucas (Inactive) [ 16/Oct/20 ]

How would a globally aggregated cpuMillis differ from the CPU metrics we already collect - is it just that the latter includes things like eviction server threads and checkpoints not included in cpuMillis?  I presume cpuMillis does include eviction done by application threads, which I think can often be the bulk of eviction. Any other differences?

Since serverStatus already has a mechanism for enabling or disabling specific sections when you issue the serverStatus command, do we really want a separate setParameter for this?

FTDC includes everything that is in serverStatus by default, and has provision to enable or disable specific sections, so it is easy to add an entire section like this (or would be if this just used the normal serverStatus mechanism for including or excluding sections). Adding only specific metrics from a section would I require reworking, probably not worth it.

 

Comment by Louis Williams [ 16/Oct/20 ]

bruce.lucas, aside from cpuMillis, I don't think these would be that useful in FTDC. The metrics are derived from other, related metrics we already collect, but I don't think they would give us much more information that would aid in debugging anything.

cpuMillis will aggregate all of the CPU time taken by all user operations, which could be useful for debugging some failures. How hard is this to add to FTDC?

Comment by Bruce Lucas (Inactive) [ 16/Oct/20 ]

Thanks louis.williams. Do you have an opinion on whether it would be useful to include these in ftdc? My general sense is that since they are global they replicate metrics already collected elsewhere, mostly by WT, but I'm not sure of that. Possible exceptions might be idxEntriesRead and keysSorted. I'll also query other members of my team.

Comment by Bruce Lucas (Inactive) [ 15/Oct/20 ]

I can't figure out from that description what the proposal for serverStatus is. Can we enumerate (at some point) the specific fields we're proposing to add to serverStatus? Also whether these are included by default or enabled by a parameter, and whether we will include them in ftdc.

Comment by Eric Milkie [ 15/Oct/20 ]

All of the metrics in the project will be added, but aggregated globally rather than by-database.

Comment by Bruce Lucas (Inactive) [ 15/Oct/20 ]

Can we post on this ticket a mini-design i.e. what metrics we're planning on adding?

Generated at Thu Feb 08 05:26:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.