[SERVER-43119] FailPoint API can be significantly improved in C++17 Created: 01/Sep/19  Updated: 29/Oct/23  Resolved: 10/Sep/19

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: 4.3.1

Type: Improvement Priority: Major - P3
Reporter: Billy Donahue Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Problem/Incident
causes SERVER-43317 Update failPoint documentation Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.2
Sprint: Dev Tools 2019-09-09, Dev Tools 2019-09-23
Participants:
Linked BF Score: 20

 Description   

Our FailPoint API has several significant deficiencies.
It's hard to itemize them in a Jira ticket, but I'll try.

I've made code changes to show where I feel it should go.

https://mongodbcr.appspot.com/512010001/
https://mongodbcr.appspot.com/475530001/ (enterprise)

The FailPoint::shouldFailOpenBlock and FailPoint::shouldFailCloseBlock are very hard to explain, and the docs are (necessarily) unclear. Nobody uses these functions anyway. They use the similarly obscure variadic MONGO_FAIL_POINT_BLOCK{_IF} macros. Can make these private and expose only 4 block-related functions: execute(F), executeIf(F,Pred), scoped(), and scopedIf(Pred) for relatively infrequent situations requiring finer flow control. RetCode is complex and need not be public. Many uses of the MONGO_FAIL_POINT_BLOCK macro would be more efficiently rendered with a predicate but weren't, I think because it's hard to figure out how to use this old API.

void MONGO_FAIL_POINT_PAUSE_WHILE_SET(FailPoint& failPoint);
void MONGO_FAIL_POINT_PAUSE_WHILE_SET_OR_INTERRUPTED(FailPoint& failPoint, OperationContext*opCtx);

These are not even macros, but they used to be and were rewritten as functions without being renamed (out of laziness I guess). Anyway these easily can be member functions of FailPoint: .pauseWhileSet() and .pauseWhileSet(opCtx).

#define MONGO_FAIL_POINT(symbol) MONGO_unlikely(symbol.shouldFail())
Doesn't pull its weight and has a confusing name. Removed. We don't make a new macro for every unlikely predicate in our codebase. It's sufficient to give advice in the docs for shouldFail(), advising users to wrap it in MONGO_unlikely. I've inlined all existing uses of this macro to seed the example pool.

The most important upgrade in this ticket: Bring class ScopedFailPoint inside the FailPoint class, as FailPoint::Scoped. It can be easily understood and documented as a refcount holder. Mandatory Copy Elision (MCE) in C+17 makes this much easier to write properly. We can just return an immovable Scoped object from the scoped() and scopedIf(pred) members and it will have proper RAII refcounting semantics. We can get rid of the very weird behavior of isActive() returning true "at most once" and using the ScopedFailPoint in for loops. The for loop was just a hack to get a scoped decl for the loop variable, but you can do that with `if` statements now in C+17, so all the trickery can be deleted.

Get rid of std::tuple and std::tie in parseBSON. Easier and clearer to make a struct representing the parsed data.

We don't need a MONGO_INITIALIZER for each FailPoint. The FailPoint initialization (and possible export as ServerParameters) can all happen in one initializer. These are just bloating the mongo initializer graph. They can all get added directly to the `globalFailPointRegistry` by a static-duration registration object's initializer. There's no user-visible API change for this, as all the registered fail points are created with a MONGO_FAIL_POINT_DEFINE macro.

Don't need the MONGO_FAIL_POINT_DECLARE(name) macro. Users can write extern FailPoint name; directly where needed. It's actually easier to write. Too many macros, and this one in particular used to be completely wrong, making a definition rather than a mere extern declaration. Of course there's no room for misinterpretation when what users type out is native C++.

FailPointEnableBlock needs an explicit on its 1-arg ctor.



 Comments   
Comment by Billy Donahue [ 28/Oct/19 ]

ok let's discuss possibilities on BACKPORT-5479

Comment by Billy Donahue [ 30/Sep/19 ]

Dan, sorry I missed this comment when you posted it, and you're right. I'm trying to make the thing EASIER to use, so the last thing I'd want is for the upgrades to catch users off guard. Sorry about that. I'm going to combine all the failpoint APIs into one header, centralize and refresh its documentation there, and post a heads-up to kernel@ before committing that change (SERVER-43317).

Comment by Githook User [ 10/Sep/19 ]

Author:

{'name': 'Billy Donahue', 'username': 'BillyDonahue', 'email': 'billy.donahue@mongodb.com'}

Message: SERVER-43119 FailPoint cleanup

  • Don't use MONGO_INITIALIZER to declare each fail point.
    We only need one init task in total: freeze and iterate the registry.
Comment by Githook User [ 10/Sep/19 ]

Author:

{'name': 'Billy Donahue', 'username': 'BillyDonahue', 'email': 'billy.donahue@mongodb.com'}

Message: SERVER-43119 FailPoint cleanup
Branch: master
https://github.com/10gen/mongo-enterprise-modules/commit/c366ab7e33fdeb4365df8bb2f4b902dcb7cc8f22

Comment by Billy Donahue [ 03/Sep/19 ]

Code Reviews:

https://mongodbcr.appspot.com/512010001/
https://mongodbcr.appspot.com/475530001/ (enterprise)

Generated at Thu Feb 08 05:02:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.