[SERVER-50434] exceptions violating noexcept can be invisible to terminate handler Created: 20/Aug/20  Updated: 02/Feb/24

Status: Backlog
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Billy Donahue Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: 6.1-targeted, sa-remove-fv-backlog-22
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-70148 Document proper noexcept usage Open
Assigned Teams:
Build
Operating System: ALL
Participants:
Linked BF Score: 130
Story Points: 4

 Description   

When an exception propagates out of a noexcept function, the std::terminate and terminate handler are invoked. During that callback, we check for std::current_exception and try to print a nice diagnostic about what exception was thrown, and we log it. However, we see that we reach the std::terminate only to find that std::current_exception is unexpectedly empty.

( BF-17935 and others )

This has the potential to be a very big support and testing problem, as we'll just have crashes in noexcept functions, with no hint as to the cause. I see a lot of effort in other tickets (e.g. SERVER-36434 ) toward noexcept explicitly to improve diagnostics, because the stack trace of a noexcept violation should better locate the problem. But noexcept will in these cases have the opposite effect, and causes the exception to not be logged at all. All we got in the case of BF-17935 was a partially unwound stack trace from the noexcept function and no logged exception details.

 

Acceptance Criteria:

Write crashing test that violates noexcept and compare vs ordinary uncaught exception propagation on all buildvariants.
Propose a solution or describe rationale for why we can't fix.



 Comments   
Comment by Steve Gross [ 08/Aug/23 ]

This will be addressed as part of the linked epic.

Comment by Jason Chan [ 27/Jul/23 ]

Assigning this to SDP for tracking purposes to make sure the next toolchain includes the patch for the gcc bug mentioned.

In the meantime, Service Arch will focus on SERVER-70148 to come up with guidelines in noexcept usage in order to minimize this happening.

Comment by Alex Neben [ 27/Jul/23 ]

I am nervous to try and port this into the v4 toolchain.

This patch is a non-zero amount of work for our team since it does not apply cleanly. Our gcc is so old that it is missing cp-gimplify.cc, except.cc , tree-eh.cc which means additional patches would be required. Furthermore this has ABI changes (I think) which also increases the risk. I would rather punt this for our next toolchain rollout. If this is included in gcc14 then we will get it next toolchain rollout. We can put this ticket in that epic if you would like to hand it off to us.

Comment by Billy Donahue [ 24/Jul/23 ]

GCC seems to have fixed the bug referenced above, slated for gcc v14.0.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97720

Perhaps toolchain could experiment with a cherry-pick of his patch.
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=2415024e0f81f8c09bf08f947c790b43de9d0bbc

Thx tyler.brock@mongodb.com for checking up on this.

Comment by Phoebe Du [ 06/Jul/23 ]

Phoebe will create a PM ticket for this work and blake.oler@mongodb.com will need to provide summary and motivation

The issue will be addressed in  https://jira.mongodb.org/browse/PM-3425

For better visibility, sending this ticket into backlog and linking this ticket to the project

 

Comment by Andrew Witten (Inactive) [ 30/Sep/22 ]

I just want to add that in BF-26507 we have run into a very similar situation as BF-17935.  We are calling the terminate handler with zero helpful log messages.  We are seeing the "no exception active" message due to the gcc bug.  

Our use of noexcept, combined with this gcc bug, is a big problem.

In most of the places where we use noexcept, we would be better off catching the exception, logging a message, and calling std::terminate.

(I am assuming in all this that the reason why we are in the terminate handler is because we violated a noexcept specification.)

Comment by Billy Donahue [ 24/May/22 ]

Seems to confirm this gcc bug:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97720

(Includes a repo similar to the one we'd have written for the acceptance criteria above.)

Comment by Benjamin Caimano (Inactive) [ 28/Aug/20 ]

I think you may have simply chosen one of the few matrix entries that does it correctly, namely the Ubuntu 18.04 Debug variants which have either opt=off or CC=clang. For instance, this RHEL 6.2 patch shows "No exception is active".

Happily, I figured out my unittest failure, so I have a new run here. To my surprise, my set of DEATH_TESTs for this actually caught it in the cloud. The variants that pass util_test also give the correct failure in jscore.

Comment by Billy Donahue [ 28/Aug/20 ]

All the failed jsCore tests I'm looking at are dying with this:

https://logkeeper.mongodb.org/lobster/build/9dc7744ff3a25cda9546b551c0f56cd4/test/5f4833999041306474af8a80#bookmarks=0%2C42%2C65&l=1
[ This was taken from Ubuntu18.04 DEBUG. ]

[MongoDFixture:job1] 2020-08-27T22:28:43.187+0000 | 2020-08-27T22:28:43.187+00:00 F  CONTROL  4757800 [initandlisten] "Writing fatal message","attr":{"message":"DBException::toString(): InternalError: Violate before startup complete\nActual exception type: mongo::error_details::ExceptionForImpl<(mongo::ErrorCodes::Error)1, mongo::AssertionException>\n"}
[MongoDFixture:job1] 2020-08-27T22:28:43.673+0000 Waiting to connect to mongod on port 20250.

... Followed by an expected stacktrace.

All the jsCore fails seem to look likes this. That is, they're finding and dumping the exception correctly from the terminate handler. Are there failures in the evergreen run that show a missing exception?

Maybe I misunderstood? Should I be looking at the unittest failures for the repro instead of the jsCore failures?

Comment by Benjamin Caimano (Inactive) [ 27/Aug/20 ]

Alright, I've been able to reproduce reliably in evergreen using this branch. (It also has some hacks in it to compile with C++20 just in case it was related to language standard somehow.) Here is a test run with a good cross-section. The jscore failures are intentional (so as to show the terminate messages). The unittest failure is sadly unintentional, apparently there is some strange magic in the template-requires for comparison operators.

Generated at Thu Feb 08 05:22:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.