[SERVER-54960] ReshardingMetrics time smearing Created: 04/Mar/21  Updated: 29/Oct/23  Resolved: 15/Jun/21

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: 5.0.0-rc2, 5.1.0-rc0

Type: Bug Priority: Minor - P4
Reporter: Billy Donahue Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-autocommits
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Participants:
Story Points: 1

 Description   

ReshardingMetrics::OperationMetrics::TimeInterval is an "active" object, each with its own clockSource, and this means they are essentially independent unlinked stopwatches.

This is redundant, but it also presents the architectural problem that ReshardingMetrics now needs a ServiceContext*, which is a much heavier dependency than necessary. All ReshardingMetrics really needs is one ClockSource*. The timeIntervals do not need their own copy of this ClockSource*.

Since each TimeInterval is calling clockSource->now() internally, the metrics are susceptible to time smearing errors as the intervals are individually started and stopped with incoherent timing. While this doesn't have practical implications, it would perhaps make testing more difficult. We wouldn't for example be able to assert that the time spent in each phase add up to the total operation time. We would need to estimate some error terms or have flaky tests.

A small change that would solve these problems would be to keep the ClockSource* as a concern only of the ReshardingMetrics object, and when we lock its _mutex, generate a single now() call, so that everything the ReshardingMetrics records will be based on the instant at which its mutex was locked. I think this will simplify things. The TimeInterval methods start, end, and tryEnd, and duration would all accept a Date_t now instead of making their own clockSource->now() calls internally.



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 16/Jun/21 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-54960 ReshardingMetrics time smearing (and refactor)
Branch: SERVER-34632
https://github.com/mongodb/mongo/commit/431106074ec36920816a002296f10d44fdf9d971

Comment by Githook User [ 15/Jun/21 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-54960 ReshardingMetrics time smearing (and refactor)

(cherry picked from commit 431106074ec36920816a002296f10d44fdf9d971)
Branch: v5.0
https://github.com/mongodb/mongo/commit/dd1c824fbddb22636bb9e67f641cfd1662828eee

Comment by Githook User [ 14/Jun/21 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-54960 ReshardingMetrics time smearing (and refactor)
Branch: master
https://github.com/mongodb/mongo/commit/431106074ec36920816a002296f10d44fdf9d971

Comment by Billy Donahue [ 12/Jun/21 ]

Code Review
https://mongodbcr.appspot.com/791250005/

Comment by Billy Donahue [ 11/Jun/21 ]

I'd like to take this one. There are other synchronization bugs in ReshardingMetrics.
I feel the implementation has the wrong shape overall and needs an (internal) refactor.

ReshardingMetrics::getOperationRemainingTime()

appears to be unsynchronized. i feel like I filed a ticket about that too but I can't find it.
I'll throw that in while I'm at it.

Generated at Thu Feb 08 05:35:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.