Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 7.1.0-rc0, 7.0.7
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v7.0
Sprint:
Service Arch 2023-05-01, Service Arch 2023-05-15
Linked BF Score:
22
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

OperationCPUTimerTest::TestReset has a step where it starts a timer, busy waits for 1ms, and then checks to see if the timer observed 1ms of elapsed CPU time. This has proven to be a flaky assertion, with the elapsed time sometimes being less than 1ms. Most failures see an elapsed CPU time in the 800,000ns range, but some go as low as 150,000ns.

My understanding of why this might happen is the following:

The busy waiting is implemented by creating an atomic boolean and continually reading it in a loop, while having a separate thread use std::current_thread::sleep_for to sleep for 1ms before setting the atomic.
Under the hood, sleep_for uses nanosleep, which is documented as using Linux's CLOCK_MONOTONIC.
The timer being tested calculates elapsed time by subtracting the two measurements taken during start() and stop(). Both of these measurements are made using Linux's CLOCK_THREAD_CPUTIME, which is based on the number of CPU cycles observed.
Differences in the ways these two clocks operate might lead to inconsistencies in the amount of time spent sleeping and the amount of time perceived as being used by the CPU.

In particular, I have a few ideas about what these differences might be:

MONOTONIC_CLOCK always moves forward and isn't subject to regular changes of the system clock, but it can be adjusted by ntpd. If ntpd were slewing the clock forward, it could make the thread wake up sooner than expected, causing the observation of CPU time to stop earlier than planned.
The virtualization used for the Evergreen hosts could affect the measurement of a thread's CPU time. While the test thread is "busy waiting", the hypervisor could steal some CPU time away for other hosts running on the same hardware, decreasing the number of cycles that CLOCK_THREAD_CPUTIME observes as being performed by the thread. Meanwhile, the thread using CLOCK_MONOTONIC still sleeps for the normal 1ms. When checking how much CPU time was used, the timer will then under-report elapsed time due to the number of cycles being less than was expected given its idea of the virtual CPU frequency.
- See here for an overview of some of the issues with time keeping and virtualization: https://www.kernel.org/doc/Documentation/virtual/kvm/timekeeping.txt

I can't reproduce this test failure, so I can't be sure that this diagnosis is correct, but the implementation of PosixTimer and the test are simple enough that it seems unlikely that these failures are due to a bug in the server. In light of that, a solution to this would be to make the test accept a wider range of observed CPU time, including time less than the 1ms that was meant to be busy-waited.

related to

SERVER-67990 Modify TestTimerElapsedAfterMultipleDetachAttach to run multiple times

Closed

SERVER-70091 OperationCPUTimerTest::TestReset increase busyWait time

Closed

Assignee:: Patrick Freed
Reporter:: Patrick Freed
Participants:: Githook User, Patrick Freed
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Apr 28 2023 10:32:23 PM UTC
Updated:: Feb 22 2024 03:24:10 PM UTC
Resolved:: May 01 2023 05:09:39 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates