[SERVER-49434] Mark all 'getAsync' calls as noexcept in network_interface_tl.cpp Created: 10/Jul/20  Updated: 29/Oct/23  Resolved: 30/Sep/20

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: None
Fix Version/s: 4.8.0

Type: Bug Priority: Major - P3
Reporter: Janna Golden Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: servicearch-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-49435 uassert in NetworkInterfaceTL::setTim... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Service arch 2020-09-07, Service arch 2020-10-05
Participants:
Linked BF Score: 37

 Description   

getAsync is marked noexcept, and it would be helpful to mark any calls to this in network_interface_tl.cpp as such. Some of the calls to getAsync are marked as such and some are not currently which has made it more difficult to diagnose BFs (i.e. BF-18043 and BF-17188).



 Comments   
Comment by Githook User [ 30/Sep/20 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-49434 Future::getAsync documentation
Branch: master
https://github.com/mongodb/mongo/commit/c33af563f2520e9b10bbfd6d48aec180c95299a5

Comment by Billy Donahue [ 28/Sep/20 ]

CR (docs only) https://mongodbcr.appspot.com/686540003/

Comment by Billy Donahue [ 28/Sep/20 ]

I'm trying to improve the comment on Future::getAsync. I can't tell if it needs the func to return exactly void or if there's some "shenanigans" whereby other returns are ok as long as they map to void under the UnwrappedType meta-transform. The latter is literally what the REQUIRES clause says, but intent is unclear.
A getAsync(func) where func returns a Status would work because UnwrappedType<Status> is void. At least that's how it seems to read, but that doesn't quite make sense, and doesn't match the behavior of ExecutorFuture, which insists on func returning exactly void and nothing else, according to its particular internal static_assert, which does not appear in its REQUIRES clause.
It's all pretty confusing so I want to be the last person to have to read the implementation to figure out what the real requirements on the getAsync callback are.

Comment by Billy Donahue [ 03/Sep/20 ]

Some discussion has ensued. The documentation around Future and .getAsync in particular needs to be improved. I will start there. I don't like sprinkling the C++ noexcept keyword into the lambda expressions at specific call sites as a form of documentation, because it has a real effect on the generated code and it doesn't solve the underlying cognitive problem.

 

/**
 * This ends the Future continuation chain by calling a callback on completion. Use this to
 * escape back into a callback-based API.
 *
 * For now, the callback must not fail, since there is nowhere to propagate the error to.
 * TODO decide how to handle func throwing.
 */

That's not obvious. "Must not fail" is not precise enough.

There are no usage examples, or they're undiscoverable. Even the tests for Future don't show how to actually use it. They are a long sequence of macro invocations. FUTURE_SUCCESS_TEST, FUTURE_FAIL_TEST. You don't get to actually see a Future being created and used. This framework makes it for you and you receive it as a lambda parameter. I'm finding the whole business hard to ramp up on.

Comment by Billy Donahue [ 28/Aug/20 ]

Code review to add a unit test to explore noexcept violation diagnostics.
https://mongodbcr.appspot.com/672150009/

Comment by Billy Donahue [ 28/Aug/20 ]

I'm trying to understand how adding noexcept to the lambdas in NetworkInterfaceTL might help diagnostics.

I'm looking at BF-18043 now, and I see setTimer in the stack trace, and setTimer was the function with the uassert. The reported stack did not start with the noexcept getAsync frame. It seems to start with the origin of the exception.

mongo::executor::NetworkInterfaceTL::CommandStateBase::setTimer() [clone .cold.1686]

We also get the exception itself printed out from our terminate handler, myTerminate.

NetworkInterfaceExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 20490ms, timeout was set to 20000ms\nActual exception type: mongo::error_details::ExceptionForImpl<(mongo::ErrorCodes::Error)202, mongo::ExceptionForCat<(mongo::ErrorCategory)1>, mongo::ExceptionForCat<(mongo::ErrorCategory)10> >\n"}

I mean the diagnostics for this noexcept violation seemed to work perfectly. There are cases we see where it doesn't work, and the exception doesn't get reported as active, so we have difficult debugs like BF-17935 (omg). But that doesn't seem to be what happened here.

What's the theory behind adding noexcept to NetworkInterfaceTL's getAsync lambda expressions as a means of improving debuggability?

My understanding of at least GCC's behavior is that there is a 2-phase exception handling. Searching up the stack for a catch handler is phase 1. Phase 2 actually does the destructors and unwinding. The search stops at any noexcept frame and phase 2 doesn't happen. The personality function just jumps straight to std::terminate. It doesn't matter if the noexcept is attached to getAsync or to the callback it's running. Interposed frames that aren't noexcept don't change the behavior as long as there's a noexcept somewhere on the stack.

Comment by Billy Donahue [ 27/Aug/20 ]

What does it mean to mark a function call as noexcept?

Generated at Thu Feb 08 05:19:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.