[CDRIVER-3674] _mongoc_handshake_build_doc_with_application core dumps with strlen call Created: 15/May/20  Updated: 27/Oct/23  Resolved: 26/May/20

Status: Closed
Project: C Driver
Component/s: bsd, libmongoc
Affects Version/s: 1.14.0, 1.15.0, 1.16.2, 1.17.0-beta
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sergey Baranov Assignee: Kevin Albertson
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Openbsd 6.6. driver 1.16.2



 Description   

Hi,

i migrated to 1.16.2 (with MongoDB 3.2) from very very old 1.0.2 release (was MongoDB 2.6). Im using driver for years with the simple setup, so i completed migration with no changes in mongoc snippets at all. The build is compiled from the sources with cmake / gcc, with no additional cmake options.

And as far as i run new build ive got my application core dump.

it is when i call mongoc_collection_remove() or mongoc_collection_insert() and this does not always happens. With the same query it may dumped or may not, and crashes about once out of ten times.

 

gdb trace here

#0 strlen () at /usr/src/lib/libc/arch/amd64/string/strlen.S:125
#1 0x000001ccf0f4db22 in _mongoc_handshake_build_doc_with_application () from /usr/local/lib/libmongoc-1.0.so.0.0
#2 0x000001ccf0f816d1 in _build_ismaster_with_handshake () from /usr/local/lib/libmongoc-1.0.so.0.0
#3 0x000001ccf0f815af in _mongoc_topology_scanner_get_ismaster () from /usr/local/lib/libmongoc-1.0.so.0.0
#4 0x000001ccf0f82c08 in _begin_ismaster_cmd () from /usr/local/lib/libmongoc-1.0.so.0.0
#5 0x000001ccf0f82a7d in mongoc_topology_scanner_node_setup_tcp () from /usr/local/lib/libmongoc-1.0.so.0.0
#6 0x000001ccf0f82203 in mongoc_topology_scanner_node_setup () from /usr/local/lib/libmongoc-1.0.so.0.0
#7 0x000001ccf0f8336b in mongoc_topology_scanner_start () from /usr/local/lib/libmongoc-1.0.so.0.0
#8 0x000001ccf0f7b2dc in mongoc_topology_scan_once () from /usr/local/lib/libmongoc-1.0.so.0.0
#9 0x000001ccf0f7b244 in _mongoc_topology_do_blocking_scan () from /usr/local/lib/libmongoc-1.0.so.0.0
#10 0x000001ccf0f7b88c in mongoc_topology_select_server_id () from /usr/local/lib/libmongoc-1.0.so.0.0
#11 0x000001ccf0f290c0 in _mongoc_cluster_select_server_id () from /usr/local/lib/libmongoc-1.0.so.0.0
#12 0x000001ccf0f24f14 in _mongoc_cluster_stream_for_optype () from /usr/local/lib/libmongoc-1.0.so.0.0
#13 0x000001ccf0f25029 in mongoc_cluster_stream_for_writes () from /usr/local/lib/libmongoc-1.0.so.0.0
#14 0x000001ccf0f2e2ad in _mongoc_collection_write_command_execute () from /usr/local/lib/libmongoc-1.0.so.0.0
#15 0x000001ccf0f30785 in mongoc_collection_remove () from /usr/local/lib/libmongoc-1.0.so.0.0

 

gcc -v
Reading specs from /usr/lib/gcc-lib/amd64-unknown-openbsd6.6/4.2.1/specs
Target: amd64-unknown-openbsd6.6
Configured with: OpenBSD/amd64 system compiler
Thread model: posix
gcc version 4.2.1 20070719

CVS to this strlen

https://cvsweb.openbsd.org/cgi-bin/cvsweb/~checkout~/src/lib/libc/arch/amd64/string/strlen.S?rev=1.8&content-type=text/plain

 

PS: im new here, so i could missing something in rules. youre welcome to ask me.



 Comments   
Comment by Sergey Baranov [ 26/May/20 ]

thank you much for investigation and focused support.

Comment by Kevin Albertson [ 26/May/20 ]

Great, glad to hear it is working now!

Comment by Sergey Baranov [ 26/May/20 ]

i think you are right!

i made code review and noticed that one contributor commited new function with mongoc_init/mongoc_cleanup calls, and it called few steps before my mongoc routine, whitch contains these too. I knew that init and cleanup must be once application starts and terminates, but since the new function actually do nothing with the database yet (its
for future use), i cannot catch it with my debugs. But it seems it was leading sometimes to incorrect mongoc states.

I just moved mongoc_init/cleanup calls outside of that local fuctions and i get 100+ application cycles with zero segfaults.

So i would ask you to close the issue.

Comment by Kevin Albertson [ 26/May/20 ]

Thanks for the response.

And it seems im using a bit old mongoc API (it is exactly the same i was using in mongoc 1.0.2), and as i know its not deprecated or so, but i thought may be to re-write it with newest API. How you think would it change something in the mongoc calls chain? It seems not the case.

That's true. mongoc_collection_insert is documented as being superseded by mongoc_collection_insert_one and mongoc_collection_insert_many. But mongoc_collection_insert is a light wrapper around mongoc_collection_insert_one. Though it's probably best to change those calls to the newer API, that should not change behavior.

One other guess... is it possible mongoc_cleanup is called before the application terminates? The snippets do not include those calls, but perhaps it is elsewhere. mongoc_cleanup cleans up all global state. If it was called (from any thread) that would invalidate libmongoc's global state (including the handshake).

Comment by Sergey Baranov [ 26/May/20 ]

At the first glance, it seems that simple looped snippets arent going with segfault for me...
Im not sure how thats linking to the issue, but its good signal to investigate my original process a bit deeper.

Kevin, lets hold this issue, i hope i will catch something new if the root cause leads to my code around.

Comment by Sergey Baranov [ 26/May/20 ]

I do multiple threads, but threads do not work with mongoc ever (neither mongoc nor the data which i use with mongoc), all the mongoc operations are in parent thread only. Moreover, i thought there may be some kind of race condition when system is running multiple processes (multiple parent threads doing with mongoc and one of them segfaults), but single process had segfault too.

i dont know, probably i should try to make simple program, which will loop my snippet to check will it crash or no.
And it seems im using a bit old mongoc API (it is exactly the same i was using in mongoc 1.0.2), and as i know its not deprecated or so, but i thought may be to re-write it with newest API. How you think would it change something in the mongoc calls chain? It seems not the case.

No, i do not call `mongoc_handshake_data_append`.

Comment by Kevin Albertson [ 25/May/20 ]

Thank you for the snippets. Those seem reasonable to me. Modifying them slightly and running them on my end did not reproduce the same segfault. So I may need more information to diagnose.

Given that this does not reproduce consistently, is your application creating multiple threads? Looking through mongoc-handshake.c, I do see a minor data race if multiple single-threaded mongoc_client_t were to be running in separate threads (as opposed to multiple mongoc_client_t obtained from a mongoc_client_pool_t). Sure enough running an example creating 100 threads with single threaded mongoc_client_t produces a warning from a thread-sanitizer:

WARNING: ThreadSanitizer: data race (pid=31379)
 Write of size 1 at 0x000104572128 by thread T1:
 #0 _mongoc_handshake_freeze mongoc-handshake.c:565 (libmongoc-1.0.0.dylib:x86_64+0x6560b)
 #1 _mongoc_topology_do_blocking_scan mongoc-topology.c:643 (libmongoc-1.0.0.dylib:x86_64+0xa3c68)
 #2 mongoc_topology_select_server_id mongoc-topology.c:879 (libmongoc-1.0.0.dylib:x86_64+0xa46fa)
 #3 _mongoc_cluster_select_server_id mongoc-cluster.c:2236 (libmongoc-1.0.0.dylib:x86_64+0x301ae)
 #4 _mongoc_cluster_stream_for_optype mongoc-cluster.c:2282 (libmongoc-1.0.0.dylib:x86_64+0x2a5ac)
 #5 mongoc_cluster_stream_for_writes mongoc-cluster.c:2368 (libmongoc-1.0.0.dylib:x86_64+0x2a6e7)
 #6 _mongoc_collection_write_command_execute_idl mongoc-collection.c:94 (libmongoc-1.0.0.dylib:x86_64+0x383e6)
 #7 mongoc_collection_insert_one mongoc-collection.c:1639 (libmongoc-1.0.0.dylib:x86_64+0x381c7)
 #8 threadfn example-client.c:16 (example-client:x86_64+0x100003ab4)
 
Previous write of size 1 at 0x000104572128 by thread T2:
 #0 _mongoc_handshake_freeze mongoc-handshake.c:565 (libmongoc-1.0.0.dylib:x86_64+0x6560b)
 #1 _mongoc_topology_do_blocking_scan mongoc-topology.c:643 (libmongoc-1.0.0.dylib:x86_64+0xa3c68)
 #2 mongoc_topology_select_server_id mongoc-topology.c:879 (libmongoc-1.0.0.dylib:x86_64+0xa46fa)
 #3 _mongoc_cluster_select_server_id mongoc-cluster.c:2236 (libmongoc-1.0.0.dylib:x86_64+0x301ae)
 #4 _mongoc_cluster_stream_for_optype mongoc-cluster.c:2282 (libmongoc-1.0.0.dylib:x86_64+0x2a5ac)
 #5 mongoc_cluster_stream_for_writes mongoc-cluster.c:2368 (libmongoc-1.0.0.dylib:x86_64+0x2a6e7)
 #6 _mongoc_collection_write_command_execute_idl mongoc-collection.c:94 (libmongoc-1.0.0.dylib:x86_64+0x383e6)
 #7 mongoc_collection_insert_one mongoc-collection.c:1639 (libmongoc-1.0.0.dylib:x86_64+0x381c7)
 #8 threadfn example-client.c:16 (example-client:x86_64+0x100003ab4)

That is due to a boolean being written by both threads. That seems worthwhile to fix in its own right, so I filed a separate ticket: CDRIVER-3685. But I don't see how that connects to the crash in strlen you are observing.

Is it possible to include a compilable example? Including the the cmake/make commands and their output may additionally help diagnose.

Additionally, I suspect the answer is "no", but is any part of your code calling `mongoc_handshake_data_append`?

Comment by Sergey Baranov [ 21/May/20 ]

And, as i said, these two may work fine most of times.

Comment by Sergey Baranov [ 21/May/20 ]

Hi Kevin!

Sure

for example this two snippets get segfault

mongoc_client_t *mongoclient = NULL;
mongoc_collection_t *mongocollection = NULL;
const char *collection_name = "collection", *db_name = "db";
bson_oid_t oid;
bson_error_t error;
bson_t *doc = NULL, *query = NULL, *opts = NULL;
bson_iter_t iter;
const bson_t *doc_current = NULL;
mongoc_cursor_t *cursor = NULL;

mongoc_init();
mongoclient = mongoc_client_new(mongohost);
mongocollection = mongoc_client_get_collection(mongoclient, db_name, collection_name);

if(mongoclient) {
snprintf(cidstr, sizeof(cidstr), "

{\"key1\":\"val1\",\"key2\":\"val2\",\"key3\":\"val3\",\"data1\":\"data\"}

");
doc = bson_new();

if(!bson_init_from_json(doc, cidstr, -1, &error))

{ printf("BSON init error: %s\n", error.message); }

bson_oid_init(&oid, NULL);
BSON_APPEND_OID(doc, "_id", &oid);

if(options&F_INTERNAL)

{ sstr = bson_as_json (doc, &len); printf("bson_as_json %s\n", sstr); bson_free (sstr); }

//this ends with core dump
if(!mongoc_collection_insert(mongocollection, MONGOC_INSERT_NONE, doc, NULL, &error))

{ printf("%s\n", error.message); }

bson_destroy(doc);

}

#ONE MORE SNIPPET
if(mongoclient) {
query = bson_new();
BSON_APPEND_UTF8(query, "id", external_p->id);
opts = BCON_NEW ("limit", BCON_INT64 (0), "skip", BCON_INT64 (0));

cursor = mongoc_collection_find_with_opts(mongocollection, query, opts, NULL);

//this ends with core dump
while (mongoc_cursor_next(cursor, &doc_current)) {
if(bson_iter_init(&iter, doc_current) && bson_iter_find(&iter, "data"))

{ external_p->data->push(bson_iter_as_int64(&iter)); }

}

mongoc_cursor_destroy(cursor);
bson_destroy(opts);
bson_destroy(query);
}

Comment by Kevin Albertson [ 21/May/20 ]

Hello asuwish.def@gmail.com , apologies for the delayed response.

The handshake pointed to in the included stack trace is initialized globally upon calling mongoc_init, which must be called at the start of the application (see http://mongoc.org/libmongoc/current/mongoc_init.html). Is it possible there is a missing call to mongoc_init?

Though, that is just a speculation. If that does not resolve the issue, can you include any relevant snippet of your application code?

mongoc_cursor_next() segmentation fault with the same tgdb trace as above.
mongoc_collection_find_with_opts() works alltime with no issues

That is not unexpected based on the stack trace. The cursor returned by mongoc_collection_find_with_opts is lazy. It will not send the find command until the first call to mongoc_cursor_next.

Comment by Sergey Baranov [ 20/May/20 ]

is someone here?

Just to update.
mongoc_cursor_next() segmentation fault with the same tgdb trace as above.
mongoc_collection_find_with_opts() works alltime with no issues

Comment by Sergey Baranov [ 15/May/20 ]

gdb trace for insertion is the similar
#0 strlen () at /usr/src/lib/libc/arch/amd64/string/strlen.S:125
#1 0x0000013159d9d8c4 in _mongoc_handshake_build_doc_with_application () from /usr/local/lib/libmongoc-1.0.so.0.0
#2 0x0000013159dd16d1 in _build_ismaster_with_handshake () from /usr/local/lib/libmongoc-1.0.so.0.0
#3 0x0000013159dd15af in _mongoc_topology_scanner_get_ismaster () from /usr/local/lib/libmongoc-1.0.so.0.0
#4 0x0000013159dd2c08 in _begin_ismaster_cmd () from /usr/local/lib/libmongoc-1.0.so.0.0
#5 0x0000013159dd2a7d in mongoc_topology_scanner_node_setup_tcp () from /usr/local/lib/libmongoc-1.0.so.0.0
#6 0x0000013159dd2203 in mongoc_topology_scanner_node_setup () from /usr/local/lib/libmongoc-1.0.so.0.0
#7 0x0000013159dd336b in mongoc_topology_scanner_start () from /usr/local/lib/libmongoc-1.0.so.0.0
#8 0x0000013159dcb2dc in mongoc_topology_scan_once () from /usr/local/lib/libmongoc-1.0.so.0.0
#9 0x0000013159dcb244 in _mongoc_topology_do_blocking_scan () from /usr/local/lib/libmongoc-1.0.so.0.0
#10 0x0000013159dcb88c in mongoc_topology_select_server_id () from /usr/local/lib/libmongoc-1.0.so.0.0
#11 0x0000013159d790c0 in _mongoc_cluster_select_server_id () from /usr/local/lib/libmongoc-1.0.so.0.0
#12 0x0000013159d74f14 in _mongoc_cluster_stream_for_optype () from /usr/local/lib/libmongoc-1.0.so.0.0
#13 0x0000013159d75029 in mongoc_cluster_stream_for_writes () from /usr/local/lib/libmongoc-1.0.so.0.0
#14 0x0000013159d7e878 in _mongoc_collection_write_command_execute_idl () from /usr/local/lib/libmongoc-1.0.so.0.0
#15 0x0000013159d7e72f in mongoc_collection_insert_one () from /usr/local/lib/libmongoc-1.0.so.0.0
#16 0x0000013159d7e496 in mongoc_collection_insert () from /usr/local/lib/libmongoc-1.0.so.0.0

Comment by Sergey Baranov [ 15/May/20 ]

insert query (i replaced the real data)

{ "key1" : "val1", "key2" : int_val1, "key3" : "val3", "key4" :

{ "key1_1" : intval1_1, "key2_2" : "val2_2", "key3_3" : "val3_3", "key4_4" : int_val4_4, "key5_5" : "val5_5", "key6_6" : [ "val7_1", "val7_2", int_val7_3, "val7_4", "val7_5", "val7_6", "val7_7" ] }

, "_id" : { "$oid" : "5ebe90e36359dc3d01255953" } }

and remove query is simple

{ "key1" : "val1" }

 

Comment by Sergey Baranov [ 15/May/20 ]

i also tried 1.15.0, 1.14.0, 1.17.0-beta with the same issue.

Generated at Wed Feb 07 21:18:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.