[CDRIVER-3823] Crash while calling pthread_join to shut down server monitor thread Created: 11/Dec/20  Updated: 28/Oct/23  Resolved: 08/Jan/21

Status: Closed
Project: C Driver
Component/s: None
Affects Version/s: None
Fix Version/s: 1.17.4

Type: Bug Priority: Major - P3
Reporter: Kaitlin Mahar Assignee: Samantha Ritter (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

I will fill this in more on Monday, just opening the ticket now before I forget. I now have a minimal repro in Swift of this but have not gotten a chance to translate it to equivalent C code.

The gist is that since upgrading to libmongoc 1.17.0 in the Swift driver we have been experiencing somewhat frequent failures of our transactions tests on macOS. Locally I am now finding that in particular this test fails for me about 90% of the time:
https://github.com/mongodb/specifications/blob/1aa2135d930a6230b7469c02377c33e121f9719e/source/transactions/tests/mongos-recovery-token.yml#L178

Specifically, when destroying the test MongoClient (which is backed by a client pool), there seems to be a crash while calling pthread_join to shut down the server monitor threads, and the Swift program exits with signal code 13.



 Comments   
Comment by Githook User [ 01/Feb/21 ]

Author:

{'name': 'Samantha Ritter', 'email': 'samantha.f.ritter@gmail.com', 'username': 'samantharitter'}

Message: CDRIVER-3823 ignore SIGPIPE on osx (#716)
Branch: r1.17
https://github.com/mongodb/mongo-c-driver/commit/3ebee773aa9d17e87caa63fe9fa27587b12bdc79

Comment by Samantha Ritter (Inactive) [ 08/Jan/21 ]

Fixed by https://github.com/mongodb/mongo-c-driver/commit/6f423a4044fd203af51cac1cac01588f34169610

Comment by Kaitlin Mahar [ 07/Jan/21 ]

Yes!

Thanks for the explanation, that sounds very reasonable and explains why we only see this on macOS tests. It is surprising that this hasn't come up before. I suppose it is highly sensitive to timing. I guess some confluence of factors involving the new server monitor changes, as well as our tweaking of the heartbeat frequency (we started doing that when we vendored in 1.17, in part to alleviate test slowdown introduced by CDRIVER-3801) has somehow created a situation where it is more prone to occur.

Comment by Samantha Ritter (Inactive) [ 07/Jan/21 ]

Code review: https://github.com/mongodb/mongo-c-driver/pull/716

Comment by Samantha Ritter (Inactive) [ 07/Jan/21 ]

Thank you for that clarification, kaitlin.mahar! So then we think the patch worked to fix this crash, right? That's great!

I did a little more research into SIGPIPE. I wanted to understand why we haven't seen this issue before.

SIGPIPE is emitted when a program tries to write to a socket whose "read" end has been closed. It's common practice for networking code to ignore SIGPIPE when send()ing, and handle the write error gracefully instead of having the signal crash their program.

The C driver socket code makes an effort to ignore SIGPIPE on POSIX systems. To do this, we use the MSG_NOSIGNAL flag. However, despite being standardized POSIX, MSG_NOSIGNAL is not defined on OSX (boo). So, it seems that we were never ignoring this signal on OSX! I am surprised that this is the first time this issue has surfaced.

There are a couple of different ways we can ignore SIGPIPE on OSX, one of which is in the patch you tested. We'll hash the specifics out in code review.

Comment by Samantha Ritter (Inactive) [ 04/Jan/21 ]

kaitlin.mahar would you be willing to apply this patch to your libmongoc build and see if you can still reproduce the error with it? I haven't been able to reproduce this crash, but I have a theory about what's causing it, and the patch tests that theory.

The patch is on this branch on my fork: https://github.com/samantharitter/mongo-c-driver/commit/32000d379ad5d8315c463f80bbf8a09dccde2f29
Evergreen patch, to make sure it doesn't break anything else https://evergreen.mongodb.com/version/5ff347aca4cf477c3fae6eec

Comment by Kaitlin Mahar [ 16/Dec/20 ]

To provide some detail on how I am reproducing this in Swift, on macOS 10.15:

1. Start up a sharded cluster, server version 4.4.1, using this mongo-orchestration config: https://github.com/mongodb-labs/drivers-evergreen-tools/blob/master/.evergreen/orchestration/configs/sharded_clusters/basic.json

2. Create a new mongoc_client_pool_t, let's call it setupClientPool, using connection string "mongodb://localhost:27017,localhost:27018", and setting heartbeatFrequencyMS to 50 via mongoc_uri_set_option_as_int32. Note: we patch in support to configure a lower-than-standard minHeartbeatFrequencyMS ourselves for the purpose of speeding up tests: https://github.com/mongodb/mongo-swift-driver/blob/master/etc/lower-minheartbeatfrequencyms.diff, as described in https://github.com/mongodb/specifications/blob/master/source/retryable-reads/tests/README.rst#speeding-up-tests

3. using write concern majority and a client from setupClientPool, drop the collection transactions-tests.test using mongoc_collection_write_command_with_opts . return client to pool

4. using write concern majority and a client from setupClientPool, recreate the collection transactions-tests.test using mongoc_database_create_collection. return client to pool

5. create a second mongo_client_pool_t, let's call it testClientPool, also connecting to "mongodb://localhost:27017,localhost:27018", with heartbeatFrequencyMS=30000.

6. check out a client client0 from testClientPool and start a session session0 via mongoc_client_start_session

7. start a transaction on session0 via mongoc_client_session_start_transaction

8. using client0 and session0, insert a document (empty except for its driver-generated _id) into transactions-test.test via a bulk write (note that this is just an implementation detail, Swift implements all write ops via libmongoc's bulk API. you could probably do any write here)

9. get the id of the mongos that session0 is now pinned to. using mongoc_database_command_with_opts and specifying that id as server_id in the options, set this failpoint:

{
        "configureFailPoint" : "failCommand",
        "mode" : { "times" : 7 },
        "data" : {
            "failCommands" : [ "commitTransaction", "isMaster" ],
            "closeConnection" : true
        }
}

10. Attempt to commit the transaction on session0 via mongoc_client_session_commit_transaction . because of the failpoint, this should fail with an error with codeName "NoSuchTransaction" and the error label "TransientTransactionError".

11. End the session, via mongoc_client_session_destroy

12. Attempt to destroy testClientPool. during a call to pthread_join, the process terminates: Process 56653 exited with status = 0 (0x00000000) Terminated due to signal 13

Here is a backtrace to the pthread_join call (I just inserted a breakpoint just before where I am seeing the crash):

* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x000000010002ecdc CrashRepro`_thread_join(thread=0x000070000b74d000) at common-thread.c:28:4
    frame #1: 0x000000010008f33f CrashRepro`mongoc_server_monitor_request_shutdown(server_monitor=0x0000000100f07f30) at mongoc-server-monitor.c:1202:7
    frame #2: 0x000000010008f3d5 CrashRepro`mongoc_server_monitor_wait_for_shutdown(server_monitor=0x0000000100f07f30) at mongoc-server-monitor.c:1228:8
    frame #3: 0x0000000100099fa3 CrashRepro`_mongoc_topology_background_monitoring_stop(topology=0x000000010200a400) at mongoc-topology-background-monitoring.c:318:7
    frame #4: 0x00000001000a17fb CrashRepro`mongoc_topology_destroy(topology=0x000000010200a400) at mongoc-topology.c:517:7
    frame #5: 0x0000000100038e7b CrashRepro`mongoc_client_pool_destroy(pool=0x0000000101807880) at mongoc-client-pool.c:199:4
    frame #6: 0x000000010017d691 CrashRepro`closure #1 in ConnectionPool.close(self=0x00000001018074e0) at ConnectionPool.swift:126:17
    frame #7: 0x000000010017e2a4 CrashRepro`partial apply for closure #1 in ConnectionPool.close() at <compiler-generated>:0
    frame #8: 0x000000010017d71f CrashRepro`thunk for @callee_guaranteed () -> (@error @owned Error) at <compiler-generated>:0
    frame #9: 0x000000010017e2c4 CrashRepro`partial apply for thunk for @callee_guaranteed () -> (@error @owned Error) at <compiler-generated>:0
    frame #10: 0x000000010017c1f1 CrashRepro`NSCondition.withLock<T>(body=0x000000010017e2b0 CrashRepro`partial apply forwarder for reabstraction thunk helper from @callee_guaranteed () -> (@error @owned Swift.Error) to @escaping @callee_guaranteed () -> (@out (), @error @owned Swift.Error) at <compiler-generated>, self=0x0000000101807760) at ConnectionPool.swift:34:20
    frame #11: 0x000000010017d097 CrashRepro`ConnectionPool.close(self=0x00000001018074e0) at ConnectionPool.swift:107:28
    frame #12: 0x00000001001a1dfc CrashRepro`MongoClient.syncClose(self=0x0000000101806960) at MongoClient.swift:361:33
    frame #13: 0x00000001002f4987 CrashRepro`MongoClient.deinit(self=0x0000000101805dd0) at MongoClient.swift:56:34
    frame #14: 0x00000001002f4e89 CrashRepro`MongoClient.__deallocating_deinit(self=0x0000000101805dd0) at MongoClient.swift:0
    frame #15: 0x00007fff703631f0 libswiftCore.dylib`_swift_release_dealloc + 16
    frame #16: 0x00000001000be460 CrashRepro`main at <compiler-generated>:0
    frame #17: 0x00007fff70a45cc9 libdyld.dylib`start + 1

to confirm the hypothesis that it is happening sometime during the call to pthread_join I updated the body of the calling method to:

   printf("about to call pthread_join\n");
   int ret;
   ret = pthread_join (thread, NULL);
   printf("done calling pthread_join, exit code was %d%n", ret);
   return ret;

The last thing that is printed is "about to call pthread_join".

Note line numbers may be slightly off due to slight mangling Swift does of libmongoc to get SwiftPM to build it. you can see our version here https://github.com/mongodb/mongo-swift-driver/tree/master/Sources/CLibMongoC

Also, here is the debug log output throughout this:

 

2020/12/16 18:50:34.0947: [ 6743]:    DEBUG:      monitor: [localhost:27018] command or network error occurred: Failed to read 4 bytes: socket error or timeout
2020/12/16 18:50:35.0001: [ 6743]:    DEBUG:      monitor: [localhost:27018] command or network error occurred: Failed to read 4 bytes: socket error or timeout
2020/12/16 18:50:35.0107: [ 6743]:  WARNING:       stream: Failed to buffer 4 bytes
about to call pthread_join
2020/12/16 18:50:35.0161: [ 6743]:    DEBUG:      monitor: [localhost:27018] command or network error occurred: connection closed while polling
2020/12/16 18:50:35.0162: [ 6743]:    DEBUG:      monitor: [localhost:27018] command or network error occurred: Failed to read 4 bytes: socket error or timeout
2020/12/16 18:50:35.0218: [ 6743]:    DEBUG:      monitor: [localhost:27018] command or network error occurred: Failed to read 4 bytes: socket error or timeout

 

 

Signal 13 apparently means broken pipe (http://people.cs.pitt.edu/~alanjawi/cs449/code/shell/UnixSignals.htm), and when I run this via the command line I get exit code 141 which apparently also has to do with broken pipes.

Let me know if there is any more info you need!

Generated at Wed Feb 07 21:19:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.