[CDRIVER-1233] Crash after Kerberos plugin cleanup runs twice Created: 16/May/16 Updated: 03/May/17 Resolved: 16/Jun/16 |
|
| Status: | Closed |
| Project: | C Driver |
| Component/s: | None |
| Affects Version/s: | 1.3.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | David Golden | Assignee: | A. Jesse Jiryu Davis |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Mac OSX 10.11.4, Apple LLVM version 7.3.0 (clang-703.0.31), Mac Ports cyrus-sasl2 @2.1.26_6+kerberos, Mac Ports kerberos5 @1.14.2 (libkrb5.3.3) |
||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
Summary In some configurations, the following sequence leads to the MIT Kerberos5 library being destroyed twice during process exit:
The solution for C programs is to configure the driver with ./configure --enable-automatic-init-and-cleanup=no and ensure they call mongoc_init and mongoc_cleanup explicitly. In version 2.0, the driver will remove the automatic init and cleanup feature. Original report: I found an assertion in kerberos originating in the C driver while testing the CXX driver master branch (commit 8dc0b5b). Backtrace follows:
During libmongoc compilation, I did notice numerous warnings of the type: clang: warning: argument unused during compilation: '-pthread' UPDATE: Moved libmongoc configuration output from "environment":
There's this warning that libraries should not call sasl_done:
|
| Comments |
| Comment by A. Jesse Jiryu Davis [ 23/Dec/16 ] | |||||||||||||||||||||||||||||||||||||||
|
Thanks, I've updated the top of this ticket description with the current workaround for the sasl_client_done crash, and fixed the spelling error on the master branch. | |||||||||||||||||||||||||||||||||||||||
| Comment by Gustaf Neumann [ 23/Dec/16 ] | |||||||||||||||||||||||||||||||||||||||
|
Thanks a lot, works perfectly, configuring now with "--disable-automatic-init-and-cleanup". Many thanks for the quick help, and have a nice weekend and holidays! PS: Small nitpicking: on http://mongoc.org/libmongoc/1.4.0/mongoc_init.html, the word "primatives" should probably read as "primitives". | |||||||||||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 23/Dec/16 ] | |||||||||||||||||||||||||||||||||||||||
|
Thanks Gustaf, we closed this issue, even though we still call sasl_client_done, because we can't fully fix it until C Driver 2.0. In the current C Driver, you can avoid this crash by building the driver with:
Make sure you explicitly call mongoc_init and mongoc_cleanup in your code. When we release C Driver 2.0, semantic versioning will allow us to make incompatible changes to our API, and that includes removing the automatic init and cleanup feature entirely (CDRIVER-1330), and then this crash will really be fixed. | |||||||||||||||||||||||||||||||||||||||
| Comment by Gustaf Neumann [ 23/Dec/16 ] | |||||||||||||||||||||||||||||||||||||||
|
I see still this issue with the 1.5.1 release of the c-driver (testing under Mac OS X 10.11.6 using the default C compiler). The default configuration of libbson 1.5.1 and libmongoc runs on my machine into exact the same crash as reported above during libsystem_c.dylib`exit . This happens with the sasl2/gssapi/krb5 libraries from mac ports installed (cyrus-sasl2 @2.1.26_6).
commenting out sasl_client_done() in src/mongoc/mongoc-init.c fixes the issue - as well as configuring without sasl via ./configure --disable-sasl. The summary block ends with
but the c-driver still contains this call, and the issue is closed. Did this problem somehow re-appear? If there is interest, i can provide more detailed backtraces. | |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 11/Jul/16 ] | |||||||||||||||||||||||||||||||||||||||
|
I did not, sorry. | |||||||||||||||||||||||||||||||||||||||
| Comment by Jeremy Mikola [ 11/Jul/16 ] | |||||||||||||||||||||||||||||||||||||||
david.golden: Did you also research sasl_done(), which libmongoc falls back to using if sasl_client_done() is not available? | |||||||||||||||||||||||||||||||||||||||
| Comment by Jeremy Mikola [ 08/Jul/16 ] | |||||||||||||||||||||||||||||||||||||||
bjori: I just came across this issue, but this may be related to some segfaults users of the legacy and new PHP drivers have been experiencing. See:
| |||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 08/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||
|
The C++11 driver will be removing instance::current | |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 07/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||
|
If the C++11 driver removes instance::current(), then the instance will clean up when it goes out of scope in main, which will happen before global destruction starts. Recall that commenting out this line – lexical scope instance lifetime – triggers the crash with auto-destructors enabled. When commented out – i.e no mongocxx::instance ever created – libmongoc's auto-constructor initializes SASL, and then libmongoc's auto-destructor races with libkrb5's auto-destructor for the crash with mongocxx totally uninvolved. However in the original – with the lexical instance created – instance destruction at the end of main makes libmongoc call its clean up first, avoiding the bad race outcome. The problem with instance::current() is that it delays destruction until too late (after libkrb5 destructs). So elimination of instance::current should fix the current problem for mongocxx and shouldn't need a breaking change in 1.4.0. (I still recommend eliminating auto ctor/dtor in 2.0.0.) | |||||||||||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 07/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||
|
So will it be enough for C++ users, for now, to compile the C Driver with auto-destroy turned off (MONGOC_NO_AUTOMATIC_GLOBALS) and to manually cleanup the C++ driver's "instance" before main() exits? What's the recommended cleanup syntax for the C++ driver look like? Darn, I wish we'd configured the C Driver's Debian package with auto-destroy turned off from the start. I believe that, starting in 1.4.0, we should turn it off in the Debian package build, even though that's a surprising change in a minor release. | |||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 07/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||
|
Agreed, instance::current was a badly thought through idea and should be removed. The intention has always been (mod instance::current), that users were expected to explicitly manage the lifetime of the instance object which proxied for all global state in the driver. | |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 06/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||
|
I've done some further research. sasl_client_init/sasl_client_done do reference counting (since this commit), so calling them as a matched set should not be a problem and we don't need to worry about screwing up other SASL users by calling sasl_client_done. The original problem is the race condition between auto-destructors. If krb5 destructs first, then things go awry. Not calling sasl_client_done is a hack that "solves" the problem at the cost of leaking resources. I think a sufficient "solution" for 1.x could be to document that auto-destructors are not recommended and to explain why. That leaves the C++11 driver in a bit of a pickle, as relying on instance::current() to initialize a static variable destroys it too late. See | |||||||||||||||||||||||||||||||||||||||
| Comment by Hannes Magnusson [ 06/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||
|
There is nothing to init or cleanup in (currently) other implemented tls or crypto libraries. I also disagree on the symmetry argument. After all, again, that got us into the trouble we are in and is what we are trying to fix. | |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 06/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||
|
Assuming we only ever have one implementation at a time, I'm very tempted to keep it agnostic:
The fact that we have this at all is a design flaw forced on us by problematic libraries. Therefore this is the "power user access hatch" and all we're saying is "if our init/cleanup of <implementation> is a problem for you, here's how you can switch it off". I also wonder whether we should design for and insist on symmetry. Either the users initializes and cleans up, or we do, but not only one side of it. I.e. only take flags for mongoc_init_with_opts, remember them, and at cleanup do only the matching operations. | |||||||||||||||||||||||||||||||||||||||
| Comment by Hannes Magnusson [ 06/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||
|
As long as we don't remove flags, just ignore unknown values, and by default init/cleanup everything we use.. I guess its best we can do. Keep in mind we don't do OpenSSL init/cleanup routines on Windows/Darwin, but the MONGOC_INIT_OPENSSL still needs to be available, as noop.
Probably the best we can do | |||||||||||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 06/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||
|
Yeah, you've convinced me we need to pass the struct into mongoc_init_with_opts(), too, since some libraries' initialization won't be idempotent, either. I agree it's fragile. The other option is not to destroy any libraries at all and shut down with leaks. acm convinced me that this is a bad idea, because then no memory checkers will work. Is there a third option that would be less fragile, bjori? I think that giving the application full control over setup / destruction is our only option, even though it requires the user to know a lot about our implementation. | |||||||||||||||||||||||||||||||||||||||
| Comment by Hannes Magnusson [ 06/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||
|
That doesn't resolve the problem of double constructing something. I think this is fragile approach though. What happens when we remove sasl? Why should the user care at all that we currently use cyrus-sasl? | |||||||||||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 06/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||
|
New approach. After discussing with David and Andrew, it seems we should:
| |||||||||||||||||||||||||||||||||||||||
| Comment by Hannes Magnusson [ 27/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
Could this be related to | |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 27/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
I also found this warning that libraries should not call sasl_done:
| |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 27/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
jesse, that hypothesis seems right. I ran a debugging session with a breakpoint on gss_krb5int_lib_fini and it was called twice. The krb5 auto-destructor is called during program termination. Then the mongoc auto-destructor runs and calls sasl_client_done, which unloads krb5, which calls the krb5 auto-destructor again. First backtrace:
Second backtrace (with assertion fail):
Is the libc auto destructor really necessary? If the program is terminating, what needs to be cleaned up? | |||||||||||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 27/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
I can repro with the same setup: MacPorts cyrus-sasl2 @2.1.26_6+kerberos and kerberos5 @1.14.2_0, running the example-client program succeeds, but if I remove the mongoc_cleanup at the end and allow auto-destruction, it dies with the assertion on line 384. kerberos5 also auto-destructs. If you properly call mongoc_cleanup in a C program, then it runs and calls sasl_client_done, which explicitly destroys kerberos5 and disables its auto-destructor. I hypothesize that if you don't call mongoc_cleanup, it auto-destructs after kerberos5, but incorrectly re-enters kerberos5's destructor gss_krb5int_lib_fini anyway. | |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 27/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
More research:
Presumably, the krb5 destructor runs at program termination even when libmongoc's destructors are disabled, so it seems likely that something about the krb5 destructor being called from within libmongoc's destructor is the issue. | |||||||||||||||||||||||||||||||||||||||
| Comment by Hannes Magnusson [ 27/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
> I narrowed it down and found that commenting out this line is sufficient to get test_instance to fail the same assertion as test_driver (and the test fails, as expected). as acm said, the mongoc_init()mongoc_cleanup() routines are called when constructing/destructing the instance object, so its not surprising that commenting out those calls feel like they change something – the fact is, they are theoretically NOOP when you compile with --enable-automatic-init-and-cleanup as they are called as soon as the module is dlopen()ed and dlclose()ed. This correct fix for this is to compile with --enable-automatic-init-and-cleanup=no. The workaround... idk. I can only think of ridiculous static variable "have_i_been_called" workarounds. | |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 27/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
Or I'll test it tonight. With enable-automatic-init-and-cleanup=no, and calling mongoc_cleanup() twice here, the crash does not come back. | |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 27/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
Other weird findings (with init/cleanup enabled):
I narrowed it down and found that commenting out this line is sufficient to get test_instance to fail the same assertion as test_driver (and the test fails, as expected). jesse, I'll test your scenario tomorrow. | |||||||||||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 27/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
If you call mongoc_cleanup() twice now with --enable-automatic-init-and-cleanup=no does the crash come back? | |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 27/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
With --enable-automatic-init-and-cleanup=no and the MacPorts cyrus-sasl2 and kerberos5, the assertion failure does not occur. | |||||||||||||||||||||||||||||||||||||||
| Comment by Hannes Magnusson [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
Please try compiling mongoc with --enable-automatic-init-and-cleanup=no (and enabling cyrus/kerberos again) | |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
BTW, when I disable cyrus-sasl2 and kerberos5 in MacPorts, the assertion failure goes away. | |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
From the config output I posted:
Re replication... This was on my MacBook Pro. Compiler is from Xcode:
Other build tools are installed via MacPorts. PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
I had previously installed the MacPorts cyrus-sasl2 library, which pulled in kerberos5:
My libmongoc configuration/installation incantation was:
Then, from the C++ master branch:
| |||||||||||||||||||||||||||||||||||||||
| Comment by Hannes Magnusson [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
It should be. But history has uncovered problems with the sasl init/cleanup routines when called multiple times. This is known to happen when, for example, two libraries (say, libmongoc and libsamba) both assume exclusive access to sasl, and use different memory managers. Plus the whole magic ctor/dtor issue. | |||||||||||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
It is, of course, using whatever idempotency feature the build system detects in your compiler: https://github.com/mongodb/mongo-c-driver/blob/master/src/mongoc/mongoc-init.c#L127 | |||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
As to exactly once... how can I know? The C driver may-or-may-not invoke its cleanup handler automatically, but there is no programatic way for me to know. So, I do call mongoc_init and mongoc_cleanup explicitly. I'd really hope that mongoc_cleanup is idempotent. | |||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
The C++ driver calls those when you construct/destroy an 'instance' object. | |||||||||||||||||||||||||||||||||||||||
| Comment by Hannes Magnusson [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
Since acm mention it, you are calling mongoc_init() at the start of the application, and mongoc_cleanup() in the end, and exactly only once? I find it weird this is being triggered from runAllStaticTerminators | |||||||||||||||||||||||||||||||||||||||
| Comment by Hannes Magnusson [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
Do you have debug symbols for libkrb5support? I'd like to know what its asserting. We see these sort of issues in sasl cleanup routines when the cleanup happens out of order, such as allocators changes or module dependencies | |||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
The C++ driver 100% does not call fork | |||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
It happens every time. Note that tests pass and so I think the error is happening during global destruction (stack track shows _mongoc_init_dtor). I'm not deeply familiar with the tests, but grepping the test source doesn't show 'SASL' or 'fork'. acm, can you confirm? | |||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
|
The pthread warnings are harmless (server build with clang has them too) and is a strange problem where clang (unlike gcc) only requires the -pthread argument at compile time on OS X and not at link time (because there is no libpthread on OS X), but seems to think that providing it is worth a diagnostic. The fix for this, should you choose to bother with it, is to only add the -pthread argument to your link flags if you aren't on OS X. Unfortunately, if your build system auto-propagates CFLAGS to the link stage as is more or less correct, then you will always get this warning. | |||||||||||||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 26/May/16 ] | |||||||||||||||||||||||||||||||||||||||
Those pthread warnings have been around in clang builds for as long as I can remember, I don't think they're related. |