[SERVER-54586] Failpoints are no longer free when not enabled Created: 17/Feb/21  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mathias Stearn Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 2
Labels: sa-remove-fv-backlog-22
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Service Arch
Participants:
Story Points: 4

 Description   

One of the defining purposes of failpoints is that they should be free when not enabled, in order to make it reasonable to liberally use them even in fairly hot codepaths.

There are at least two issues adding unnecessary costs to disabled failpoints:

  1. There is an extra memory read to invariant that the fail point is ready, even in production builds.
  2. The fast-path check is no longer inlined, so it always calls the out-of-line function. And because the out-of-line function contains the whole code path rather than just the fast check, it needs to do all of the expensive prolog and epilog, including stack-smash protection.

In addition to solving these, we should evaluate what it would take to give failpoint a constexpr constructor (so that it is created during const-init rather than dynamic-init), and ideally a trivial destructor, so that it never gets destroyed.



 Comments   
Comment by Billy Donahue [ 27/Apr/21 ]

Just a note that the commit
https://github.com/mongodb/mongo/commit/e0ce358909fc6a71ed24470685216d2a91855050
was misattributed to Ian Boros. It was from me and there was some kind of git amend snafu on my end during development. Apologies.

Sending the ticket back to "Open" status rather than closing it fixed, just because I think there's more work here than just writing a benchmark. that was just to baseline the problem.

Comment by Githook User [ 27/Apr/21 ]

Author:

{'name': 'Ian Boros', 'email': 'ian.boros@mongodb.com', 'username': 'puppyofkosh'}

Message: SERVER-54586 FailPoint benchmark
Branch: master
https://github.com/mongodb/mongo/commit/e0ce358909fc6a71ed24470685216d2a91855050

Comment by Billy Donahue [ 20/Apr/21 ]

Benchmark:
https://mongodbcr.appspot.com/766860020/

Doesn't really show a cost beyond that of an atomic load.

 42 MONGO_FAIL_POINT_DEFINE(failPointBench);
 43 AtomicWord<bool> atomic;
 44 
 45 void BM_AtomicLoadAndBranch(benchmark::State& state) {
 46     atomic.store(!!state.range(0));
 47     for (auto _ : state) {
 48         if (MONGO_unlikely(atomic.load())) {
 49             benchmark::DoNotOptimize(0);
 50         }
 51     }
 52 }
 53 
 54 void BM_FailPointShouldFail(benchmark::State& state) {
 55     failPointBench.setMode(state.range(0) ? FailPoint::alwaysOn : FailPoint::off);
 56     for (auto _ : state) {
 57         if (MONGO_unlikely(failPointBench.shouldFail())) {
 58             benchmark::DoNotOptimize(0);
 59         }
 60     }
 61     failPointBench.setMode(FailPoint::off);
 62 }

BM_AtomicLoadAndBranch/0_mean        0.499 ns        0.497 ns           10   (set to false: likely branch)
BM_AtomicLoadAndBranch/1_mean        0.489 ns        0.489 ns           10   (set to false: unlikely branch)
BM_FailPointShouldFail/0_mean        0.475 ns        0.475 ns           10   (set to FailPoint::off)
BM_FailPointShouldFail/1_mean         18.0 ns         18.0 ns           10   (set to FailPoint::alwaysOn)

Comment by Billy Donahue [ 19/Feb/21 ]

Notes from Slack discussion:

Regarding constexpr and early accessor checks:

Regarding inlining the fast path:

  • compilers make the inline/out-of-line decision on a function granularity. So if you want a fast check to be inlined, you need it to live in an outer function that calls the slow path. And usually put NOINLINE and COLD annotations on the slow path function.
  • the function prologue/epilog is based on everything the function does. So even if the fast check doesn't need to save any registers or do the stack smash check, if the slow path is in the same function (after inlining) it will still have to pay for the expensive setup/teardown. The solution is the same: pull out the fast path check to a fronting function.
Comment by Billy Donahue [ 19/Feb/21 ]

In SERVER-48146, I added one more relaxed-load to make sure the failpoint isn't being accessed before it is initialized, because I agree that they are most useful when they have few restrictions on their placement, so I want them to available not just in hot paths, but in early paths.

But I also added an _impl() call in front of most accessors of interior state, and it looks like that is just too much to ask for fast paths according to this ticket.

A trivial destructor was proposed above but FailPoint already has a minimal destructor, as the main guts are moved into the _impl, which is made with placement new and not destroyed in production failpoints. This which was important for SERVER-48146.

The fast path can be made faster again by making an atomic bool that lives outside the _impl(), so you don't have to call _impl() first to read the enable/disable state therein. I didn't think that FailPoints were so super-performant that we can't make a function call and come back. It's clear that the FailPoint API with all its perf-jockeying needs a benchmark if we're going to keep that level of performance up. It would be nice to put before-and-after numbers up when making a change.

The point about inlining I didn't understand, though. "And because the out-of-line function contains the whole code path rather than just the fast check, it needs to do all of the expensive prolog and epilog, including stack-smash protection." It sounds like there's been real impact observed, but I didn't understand how the length of the inlined function matters to the prolog/epilog, as long as it has an early return. That's interesting.

I tried to make a constexpr ctor in an effort to make them guaranteed init-safe but I think I couldn't figure out how to do it. I think some of the ideas I had at the time required users to access the failpoint with deref syntax and I just didn't want to go there at the time. I don't remember.

Generated at Thu Feb 08 05:33:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.