[SERVER-54229] Resharding metrics output from the currentOp command should be reset when a new operation is started Created: 03/Feb/21  Updated: 29/Oct/23  Resolved: 23/Mar/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.0.0-rc0

Type: New Feature Priority: Major - P3
Reporter: Lamont Nelson Assignee: Kshitij Gupta
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-autocommits
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-54483 resharding metrics for serverStatus s... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2021-03-08, Sharding 2021-03-22, Sharding 2021-04-05
Participants:
Story Points: 2

 Description   

The following metrics should be reset when a new resharding operation is started:

    totalOperationTimeElapsedMillis: int64
    remainingOperationTimeEstimatedMillis: int64
 
    approxDocumentsToCopy: int64
    documentsCopied: int64
    approxBytesToCopy: int64
    bytesCopied: int64
    totalCopyTimeElapsedMillis: int64
 
    oplogEntriesFetched: int64
    oplogEntriesApplied: int64
    totalApplyTimeElapsedMillis: int64
 
    countWritesDuringCriticalSection: int64
    totalCriticalSectionTimeElapsedMillis: int64
 
    // Note that 0 corresponds to kUnused for these enum values and therefore
    // won’t be ambiguous when reset.
    coordinatorState: int32
    donorState: int32
    recipientState: int32



 Comments   
Comment by Githook User [ 23/Mar/21 ]

Author:

{'name': 'Kshitij Gupta', 'email': 'kshitij.gupta@mongodb.com', 'username': 'kshitijng'}

Message: SERVER-54229: Resharding metrics output from the currentOp command
should be reset when a new operation is started.

SERVER-54483: Resharding metrics for serverStatus should report
cumulative totals for the process.
Branch: master
https://github.com/mongodb/mongo/commit/f447e57dd2c5dbb39feef9cfe071ff1cc1de54d5

Comment by Lamont Nelson [ 16/Feb/21 ]

Based on our conversation last week, I created SERVER-54483 to report process lifetime totals in the server status command. I'm modifying the description of this ticket to reset the count in the current op command.

Comment by Bruce Lucas (Inactive) [ 08/Feb/21 ]

Normally we take the derivative of cumulative counters for display and report in units like "document / s", "bytes / s", etc. This allows to easily see if an operation is active and to correlate its performance with other metrics like cpu, disk, WT operations, etc. when doing performance analysis. When a cumulative counter like that is reset it results in an artificial large negative spike that is misleading and interferes with things like taking averages.

Understood about not having per-collection metrics (and that's good), but I don't see the connection between that and resetting the metrics. What are you interested in seeing that wouldn't be possible if the metrics aren't reset?

The lastCommittedTransaction metric isn't an example of what I was asking about as it's not a cumulative metric. We actually recently eliminated it from FTDC (SERVER-53609, for reasons described there but not relevant to this question) and were ok with that because "last x" type metrics generally have limited diagnostic value - they may get overwritten by the next event before they're even seen for an event of interest, and generally the last of some particular observable is not generally more interesting than all the previous ones.

I'd be reluctant to inflate FTDC with a "cumulative" and a "last op"version of the same counters. If the most recent occurrence truly is of some special interest, the log file might be a better place to get that information. Logging this information on completion of a resharding operation would ensure that the information is available for all operations, including the most recent as of any particular time.

Comment by Max Hirschhorn [ 03/Feb/21 ]

Hi bruce.lucas, we're having resharding report metrics for the actively running operation in serverStatus as part of the "shardingStatistics.resharding" section (see SERVER-52773). This is because we want to capture these metrics in FTDC. The idea behind resetting them when a new resharding operation is started is to avoid reporting per-collection metrics in serverStatus because that leads to expensive schema changes in FTDC.

SERVER-52730 will enforce that there can be at most one resharding operation active in the whole cluster. That restriction is why we can treat the "global" metrics as being for a single collection and single resharding operation.

I believe there's some prior art with lastCommittedTransaction for not tracking global measurements in serverStatus and FTDC.

Do you feel it would be more clear to split "shardingStatistics.resharding" into "shardingStatistics.resharding" and "shardingStatistics.resharding.lastOp" (or some similar name)?

Comment by Bruce Lucas (Inactive) [ 03/Feb/21 ]

Generally serverStatus metrics that represent cumulative counters or cumulative elapsed time are expected by downstream tooling to be cumulative since the server started, so I'm not sure I would expect those to be reset. Are there any other instances of cumulative server counters that get reset, for comparison?

Generated at Thu Feb 08 05:32:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.