[CDRIVER-3823] Crash while calling pthread_join to shut down server monitor thread Created: 11/Dec/20 Updated: 28/Oct/23 Resolved: 08/Jan/21 |
|
| Status: | Closed |
| Project: | C Driver |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 1.17.4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaitlin Mahar | Assignee: | Samantha Ritter (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
I will fill this in more on Monday, just opening the ticket now before I forget. I now have a minimal repro in Swift of this but have not gotten a chance to translate it to equivalent C code. The gist is that since upgrading to libmongoc 1.17.0 in the Swift driver we have been experiencing somewhat frequent failures of our transactions tests on macOS. Locally I am now finding that in particular this test fails for me about 90% of the time: Specifically, when destroying the test MongoClient (which is backed by a client pool), there seems to be a crash while calling pthread_join to shut down the server monitor threads, and the Swift program exits with signal code 13. |
| Comments |
| Comment by Githook User [ 01/Feb/21 ] | ||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Samantha Ritter', 'email': 'samantha.f.ritter@gmail.com', 'username': 'samantharitter'}Message: | ||||||||||||||||||||||||||||||||||||||||
| Comment by Samantha Ritter (Inactive) [ 08/Jan/21 ] | ||||||||||||||||||||||||||||||||||||||||
|
Fixed by https://github.com/mongodb/mongo-c-driver/commit/6f423a4044fd203af51cac1cac01588f34169610 | ||||||||||||||||||||||||||||||||||||||||
| Comment by Kaitlin Mahar [ 07/Jan/21 ] | ||||||||||||||||||||||||||||||||||||||||
|
Yes! Thanks for the explanation, that sounds very reasonable and explains why we only see this on macOS tests. It is surprising that this hasn't come up before. I suppose it is highly sensitive to timing. I guess some confluence of factors involving the new server monitor changes, as well as our tweaking of the heartbeat frequency (we started doing that when we vendored in 1.17, in part to alleviate test slowdown introduced by | ||||||||||||||||||||||||||||||||||||||||
| Comment by Samantha Ritter (Inactive) [ 07/Jan/21 ] | ||||||||||||||||||||||||||||||||||||||||
|
Code review: https://github.com/mongodb/mongo-c-driver/pull/716 | ||||||||||||||||||||||||||||||||||||||||
| Comment by Samantha Ritter (Inactive) [ 07/Jan/21 ] | ||||||||||||||||||||||||||||||||||||||||
|
Thank you for that clarification, kaitlin.mahar! So then we think the patch worked to fix this crash, right? That's great! I did a little more research into SIGPIPE. I wanted to understand why we haven't seen this issue before. SIGPIPE is emitted when a program tries to write to a socket whose "read" end has been closed. It's common practice for networking code to ignore SIGPIPE when send()ing, and handle the write error gracefully instead of having the signal crash their program. The C driver socket code makes an effort to ignore SIGPIPE on POSIX systems. To do this, we use the MSG_NOSIGNAL flag. However, despite being standardized POSIX, MSG_NOSIGNAL is not defined on OSX (boo). So, it seems that we were never ignoring this signal on OSX! I am surprised that this is the first time this issue has surfaced. There are a couple of different ways we can ignore SIGPIPE on OSX, one of which is in the patch you tested. We'll hash the specifics out in code review. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Samantha Ritter (Inactive) [ 04/Jan/21 ] | ||||||||||||||||||||||||||||||||||||||||
|
kaitlin.mahar would you be willing to apply this patch to your libmongoc build and see if you can still reproduce the error with it? I haven't been able to reproduce this crash, but I have a theory about what's causing it, and the patch tests that theory. The patch is on this branch on my fork: https://github.com/samantharitter/mongo-c-driver/commit/32000d379ad5d8315c463f80bbf8a09dccde2f29 | ||||||||||||||||||||||||||||||||||||||||
| Comment by Kaitlin Mahar [ 16/Dec/20 ] | ||||||||||||||||||||||||||||||||||||||||
|
To provide some detail on how I am reproducing this in Swift, on macOS 10.15: 1. Start up a sharded cluster, server version 4.4.1, using this mongo-orchestration config: https://github.com/mongodb-labs/drivers-evergreen-tools/blob/master/.evergreen/orchestration/configs/sharded_clusters/basic.json 2. Create a new mongoc_client_pool_t, let's call it setupClientPool, using connection string "mongodb://localhost:27017,localhost:27018", and setting heartbeatFrequencyMS to 50 via mongoc_uri_set_option_as_int32. Note: we patch in support to configure a lower-than-standard minHeartbeatFrequencyMS ourselves for the purpose of speeding up tests: https://github.com/mongodb/mongo-swift-driver/blob/master/etc/lower-minheartbeatfrequencyms.diff, as described in https://github.com/mongodb/specifications/blob/master/source/retryable-reads/tests/README.rst#speeding-up-tests 3. using write concern majority and a client from setupClientPool, drop the collection transactions-tests.test using mongoc_collection_write_command_with_opts . return client to pool 4. using write concern majority and a client from setupClientPool, recreate the collection transactions-tests.test using mongoc_database_create_collection. return client to pool 5. create a second mongo_client_pool_t, let's call it testClientPool, also connecting to "mongodb://localhost:27017,localhost:27018", with heartbeatFrequencyMS=30000. 6. check out a client client0 from testClientPool and start a session session0 via mongoc_client_start_session 7. start a transaction on session0 via mongoc_client_session_start_transaction 8. using client0 and session0, insert a document (empty except for its driver-generated _id) into transactions-test.test via a bulk write (note that this is just an implementation detail, Swift implements all write ops via libmongoc's bulk API. you could probably do any write here) 9. get the id of the mongos that session0 is now pinned to. using mongoc_database_command_with_opts and specifying that id as server_id in the options, set this failpoint:
10. Attempt to commit the transaction on session0 via mongoc_client_session_commit_transaction . because of the failpoint, this should fail with an error with codeName "NoSuchTransaction" and the error label "TransientTransactionError". 11. End the session, via mongoc_client_session_destroy 12. Attempt to destroy testClientPool. during a call to pthread_join, the process terminates: Process 56653 exited with status = 0 (0x00000000) Terminated due to signal 13 Here is a backtrace to the pthread_join call (I just inserted a breakpoint just before where I am seeing the crash):
to confirm the hypothesis that it is happening sometime during the call to pthread_join I updated the body of the calling method to:
The last thing that is printed is "about to call pthread_join". Note line numbers may be slightly off due to slight mangling Swift does of libmongoc to get SwiftPM to build it. you can see our version here https://github.com/mongodb/mongo-swift-driver/tree/master/Sources/CLibMongoC Also, here is the debug log output throughout this:
Signal 13 apparently means broken pipe (http://people.cs.pitt.edu/~alanjawi/cs449/code/shell/UnixSignals.htm), and when I run this via the command line I get exit code 141 which apparently also has to do with broken pipes. Let me know if there is any more info you need! |