[CDRIVER-4666] Deadlock due to recursive lock of non-recursive mutex in topology Created: 22/Jun/23  Updated: 28/Oct/23  Resolved: 03/Aug/23

Status: Closed
Project: C Driver
Component/s: libmongoc
Affects Version/s: 1.20.0
Fix Version/s: 1.24.3

Type: Bug Priority: Unknown
Reporter: Norbert Heußer Assignee: Kevin Albertson
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by PHPC-2265 Possible deadlock when querying durin... Closed

 Description   

Summary

We observed a deadlock due to a recursive lock on a non-recursive mutex inside the mongoc-topology.c. This stuck the whole application. In the application we try to insert documents into a mongodb. The problem was discovered in an automated test intermediately start/stopping the mongodb.Backtrace further below.

Environment

mongo-c-driver 1.23.4

Build from source with: 

-DENABLE_AUTOMATIC_INIT_AND_CLEANUP=OFF -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_TESTS=Off 

mongo-cxx-driver 3.7.1

Build from source with:

-DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/usr/local -DCMAKE_CXX_STANDARD=17 -DENABLE_TESTS=Off ..

The test system running our application is docker container based on alpine 3.18. The host running the docker containers is a server with a x86_64 CPU running a Ubuntu 22.04. 

The standalone mongodb was started as a separate docker container on the same host.

How to Reproduce

Try to insert documents into a collection of the mongodb in a loop using the mongo-cxx colletion.insert_one method. In case of an exception we sleep for some time and retry to insert the document again.

During execution of this tests we manually stopped/restarted an independent docker container with the standalone mongodb instance used by the test. Connection between application and host was performed by an docker internal bridge network.

It's important to use a linux distribution based on the libmusl (like the alpine distribution) instead of the glibc. The libmusl pthread_mutex_lock always does deadlock on a default initialized mutex in contrast to other implementations.

Additional Background

This was the backtrace we were able to extract from the stuck process:

(gdb) bt
#0  __syscall_cp_c (nr=202, u=139930834176668, v=128, w=-2147483632, x=0, y=0
    z=0) at ./arch/x86_64/syscall_arch.h:61
#1  0x00007f443082f702 in __futex4_cp (to=0x0, val=-2147483632, op=128
    addr=0x7f442faa0a9c) at src/thread/__timedwait.c:24
#2  __timedwait_cp (addr=addr@entry=0x7f442faa0a9c, val=val@entry=-2147483632
    clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128)
    at src/thread/__timedwait.c:52
#3  0x00007f443082f7a7 in __timedwait (addr=addr@entry=0x7f442faa0a9c
    val=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128)
    at src/thread/__timedwait.c:68
#4  0x00007f44308319ea in __pthread_mutex_timedlock (m=0x7f442faa0a98, at=0x0)
    at src/thread/pthread_mutex_timedlock.c:85
#5  0x00007f442fd6fabe in mc_tpld_modify_begin (tpl=0x7f442faa09d0)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:2014
#6  0x00007f442fd6c4d5 in _mongoc_topology_scanner_setup_err_cb (id=1
    data=0x7f442faa09d0, error=0x7f442faaca80)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:133
#7  0x00007f442fd7754e in mongoc_topology_scanner_node_setup (
    node=0x7f442faac800, error=0x7f442faaca80)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology-scanner.c:1135
--Type <RET> for more, q to quit, c to continue without paging--
#8  0x00007f442fd762a3 in mongoc_topology_scanner_scan (ts=0x7f442fa9c080
    id=1)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology-scanner.c:566
#9  0x00007f442fd6c62f in _mongoc_topology_scanner_cb (id=1
    hello_response=0x0, rtt_msec=0, data=0x7f442faa09d0, error=0x7f442f43a0c0)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:192
#10 0x00007f442fd76b2a in _async_error_or_timeout (acmd=0x7f442f43a080
    duration_usec=11, default_err_msg=0x7f442fdac138 "connection error")
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology-scanner.c:806
#11 0x00007f442fd76c59 in _async_handler (acmd=0x7f442f43a080
    async_status=MONGOC_ASYNC_CMD_ERROR, hello_response=0x0, duration_usec=11)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology-scanner.c:845
#12 0x00007f442fd0848f in mongoc_async_cmd_run (acmd=0x7f442f43a080)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-async-cmd.c:129
#13 0x00007f442fd0810b in mongoc_async_run (async=0x7f442faba9e0)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-async.c:160
#14 0x00007f442fd7784a in mongoc_topology_scanner_work (ts=0x7f442fa9c080)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology-scann--Type <RET> for more, q to quit, c to continue without paging--
er.c:1306
#15 0x00007f442fd6da53 in mongoc_topology_scan_once (topology=0x7f442faa09d0
    obey_cooldown=true)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:906
#16 0x00007f442fd6daa4 in _mongoc_topology_do_blocking_scan (
    topology=0x7f442faa09d0, error=0x7f442f80b9a0)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:931
#17 0x00007f442fd6e40d in mongoc_topology_select_server_id (
    topology=0x7f442faa09d0, optype=MONGOC_SS_WRITE, read_prefs=0x0
    must_use_primary=0x7f442f80bc43, error=0x7f442f80bef0)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:1234
#18 0x00007f442fd1d3ce in _mongoc_cluster_select_server_id (cs=0x0
    topology=0x7f442faa09d0, optype=MONGOC_SS_WRITE, read_prefs=0x0
    must_use_primary=0x7f442f80bc43, error=0x7f442f80bef0)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-cluster.c:2757
#19 0x00007f442fd1d4b3 in _mongoc_cluster_stream_for_optype (
    cluster=0x7f442faa0bd8, optype=MONGOC_SS_WRITE, read_prefs=0x0, cs=0x0
    reply=0x7f442f80bd80, error=0x7f442f80bef0)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-cluster.c:2805--Type <RET> for more, q to quit, c to continue without paging--
 
 
#20 0x00007f442fd1d659 in mongoc_cluster_stream_for_writes (
    cluster=0x7f442faa0bd8, cs=0x0, reply=0x7f442f80bd80, error=0x7f442f80bef0)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-cluster.c:2864
#21 0x00007f442fd0c6b3 in mongoc_bulk_operation_execute (bulk=0x7f443024b780
    reply=0x7f442f80bd80, error=0x7f442f80bef0)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-bulk-operation.c:820
#22 0x00007f44301128a9 in mongocxx::v_noabi::bulk_write::execute (
    this=0x7f442f80c230)
    at /tmp/mongo-cxx-driver-r3.7.1/src/mongocxx/bulk_write.cpp:183
#23 0x00007f443012c418 in mongocxx::v_noabi::collection::_insert_one (
    this=0x7f442f80c630, session=0x0, document=..., options=...)
    at /tmp/mongo-cxx-driver-r3.7.1/src/mongocxx/collection.cpp:546
#24 0x00007f443012c744 in mongocxx::v_noabi::collection::insert_one (
    this=0x7f442f80c630, document=..., options=...)
    at /tmp/mongo-cxx-driver-r3.7.1/src/mongocxx/collection.cpp:557 

The interesting parts are #9 together with #6. In function _mongoc_topology_scanner_cb (the one from #9) in line 170  a topology modification is started with mc_tpld_modify_begin. This will be commited in line 200. But in the error handling the code will initiate a new mongoc_topology_scanner_scan from line 191. This will lead to the #6 code fragment, which will try to start a new topology modification as well. 

To ensure both code fragments work on the same data structure I additionally checked the topology pointers in both locations. And they are working on the same object.

(gdb) up 6
#6  0x00007f442fd6c4d5 in _mongoc_topology_scanner_setup_err_cb (id=1
    data=0x7f442faa09d0, error=0x7f442faaca80)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:133
133	/tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c: No such file or directory.
(gdb) p topology
$8 = (mongoc_topology_t *) 0x7f442faa09d0
(gdb) up 3
#9  0x00007f442fd6c62f in _mongoc_topology_scanner_cb (id=1
    hello_response=0x0, rtt_msec=0, data=0x7f442faa09d0, error=0x7f442f43a0c0)
    at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:192
192	in /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c
(gdb) p topology
$9 = (mongoc_topology_t * const) 0x7f442faa09d0 



 Comments   
Comment by Kevin Albertson [ 03/Aug/23 ]

Thank you for the detailed report norbert.heusser@cedalo.com! This is fixed in development, and scheduled for the 1.24.3 patch release planned next week.

Comment by Githook User [ 03/Aug/23 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-4666 remove locks during single-threaded scanning (#1367)

  • assert single_threaded
  • remove locks during single-threaded scanning

Fixes a possible recursive mutex lock in some error scenarios. Refer: CDRIVER-4666
Branch: r1.24
https://github.com/mongodb/mongo-c-driver/commit/368214594a59782436d03ec7bd26e4fa440b58a1

Comment by Githook User [ 03/Aug/23 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-4666 remove locks during single-threaded scanning (#1367)

  • assert single_threaded
  • remove locks during single-threaded scanning

Fixes a possible recursive mutex lock in some error scenarios. Refer: CDRIVER-4666
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/29e0dc038b9264c368810a653efe308df9d90712

Comment by Norbert Heußer [ 04/Jul/23 ]

Do you need any additional information or anything else ?

Comment by PM Bot [ 22/Jun/23 ]

Hi norbert.heusser@cedalo.com, thank you for reporting this issue! The team will look into it and get back to you soon.

Generated at Wed Feb 07 21:21:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.