[CDRIVER-1211] Get Segmentation fault (11) when using mongoc_bulk_operation_execute Created: 27/Apr/16  Updated: 03/May/17  Resolved: 17/May/16

Status: Closed
Project: C Driver
Component/s: libmongoc
Affects Version/s: 1.3.5
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: tianlei.shi Assignee: A. Jesse Jiryu Davis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

In CentOS7.0, RAID5, xfs filesystem, NUMA system.


Attachments: Text File bulk_crash.c    

 Description   

Accidently we get this stacktrace when using mongodb, and my application keeps crashing:

Got signal: Segmentation fault (11), address is 0xb0 from 0x7f9ea3a010a2
=======Stack Trace===========
/opt/NetSensor/lib/libmongoc-1.0.so.0(+0x27ea1) [0x7f9ea3a02ea1]
/opt/NetSensor/lib/libmongoc-1.0.so.0(+0x26e33) [0x7f9ea3a01e33]
/opt/NetSensor/lib/libmongoc-1.0.so.0(+0xf735) [0x7f9ea39ea735]
/opt/NetSensor/lib/libmongoc-1.0.so.0(+0xf98d) [0x7f9ea39ea98d]
/opt/NetSensor/lib/libmongoc-1.0.so.0(+0x3065e) [0x7f9ea3a0b65e]
/opt/NetSensor/lib/libmongoc-1.0.so.0(+0x30913) [0x7f9ea3a0b913]
/opt/NetSensor/lib/libmongoc-1.0.so.0(mongoc_bulk_operation_execute+0x109) [0x7f9ea39e7669]

The size of each bulk is limited to 1000 in my application, and I have 4 bulks which insert into 4 different collections in the same database.
The execution sequence of the mongoc_bulk_operation_execute for each bulk is random in my application.



 Comments   
Comment by A. Jesse Jiryu Davis [ 17/May/16 ]

This code cannot compile because you have redefined "collection" in insert_bulk. There are too many other bugs for me to determine which is actually the cause of your crash.

For example, when insert_bulk does:

bulk = mongoc_collection_create_bulk_operation (collection, false, NULL);

... that does not affect the caller. Therefore each call to insert_bulk creates a new bulk operation, and only executes the bulk for one document out of 999.

I'm not able to diagnose your issue without code that I can compile and run, but I'm convinced it is not a bug in the driver.

Comment by tianlei.shi [ 17/May/16 ]

I think the stream->writev points to _mongoc_stream_socket_writev, an I right?

Comment by tianlei.shi [ 17/May/16 ]

The attached code is more or less the same logic as the code in our software.
The crash happened in the function "insert_bulk".
We insert a huge number of data in our software using the bulk operation.

Comment by tianlei.shi [ 17/May/16 ]

This can only be reproduced in a special environment, and it happened randomly.
And our software is very huge.

The bulk is created in this way: mongoc_collection_create_bulk_operation(dbOpt->collection, false, NULL);

According to the backtrace, it seems the crash happened in ret = stream->writev(stream, iov, iovcnt, timeout_msec);
Would you please tell which func stream->writev points to when the bulk is created in that way.

Comment by A. Jesse Jiryu Davis [ 17/May/16 ]

Thanks for the backtrace, but I still need to see your code in order to diagnose the crash. Can you please provide a Short, Self Contained, Compilable Example (http://sscce.org/) of code that reproduces this crash? Otherwise I cannot diagnose it. Thanks!

Comment by tianlei.shi [ 17/May/16 ]

Hi Davis,

Having recompiled libmongoc in debug mode, I get this stacktrace by gdb. Would you please take a look?

(gdb) bt
#0 0x00007fff2effc0b8 in ?? ()
#1 0x00007ffff79820e1 in mongoc_stream_writev (stream=0x7fff182cefc0, iov=0x7fff18924cc0, iovcnt=9, timeout_msec=1000)
at src/mongoc/mongoc-stream.c:162
#2 0x00007ffff7982acf in _mongoc_stream_writev_full (stream=0x7fff182cefc0, iov=0x7fff18924cc0, iovcnt=9,
timeout_msec=1000, error=0x7fff18323b38) at src/mongoc/mongoc-stream.c:486
#3 0x00007ffff795f686 in mongoc_cluster_run_command_rpc (cluster=0xb7d74f8, stream=0x7fff182cefc0, server_id=1,
command_name=0x7fff187248a5 "insert", rpc=0x7fff2effc0b0, reply_rpc=0x7fff2effc0b0, buffer=0x7fff2effbfd0,
error=0x7fff18323b38) at src/mongoc/mongoc-cluster.c:162
#4 0x00007ffff795f977 in mongoc_cluster_run_command (cluster=0xb7d74f8, stream=0x7fff182cefc0, server_id=1,
flags=MONGOC_QUERY_NONE, db_name=0x7fff18000b40 "NPM", command=0x7fff2effc280, reply=0x7fff2effc200,
error=0x7fff18323b38) at src/mongoc/mongoc-cluster.c:263
#5 0x00007ffff798ede8 in _mongoc_write_command (command=0x7fff18323e40, client=0xb7d74e0, server_stream=0x7fff18001e50,
database=0x7fff18000b40 "NPM", collection=0x7fff180019c0 "net_meta_167885562_if2", write_concern=0x7fff18323ed0,
offset=1000, result=0x7fff183238b0, error=0x7fff18323b38) at src/mongoc/mongoc-write-command.c:916
#6 0x00007ffff798f196 in _mongoc_write_command_execute (command=0x7fff18323e40, client=0xb7d74e0,
server_stream=0x7fff18001e50, database=0x7fff18000b40 "NPM", collection=0x7fff180019c0 "net_meta_167885562_if2",
write_concern=0x7fff18323ed0, offset=0, result=0x7fff183238b0) at src/mongoc/mongoc-write-command.c:978
#7 0x00007ffff795ad8a in mongoc_bulk_operation_execute (bulk=0x7fff18323830, reply=0x7fff2effc600, error=0x7fff2effc6f0)
at src/mongoc/mongoc-bulk-operation.c:438
#8 0x000000000043b027 in databaseNoSqlBulkExecute (dbOpt=0x7fff2effc960, instId=2)
at /home/release/peak2/tag/DCITS/protoAnalyzer/database/databaseLog.c:390
#9 0x0000000000451949 in netStatsLogFlowMetaToDatabase (instId=2, ruleId=10246, data=0x7ffd230eb7b0)
at /home/release/peak2/tag/DCITS/protoAnalyzer/netStats/netStatsDatabaseLog.c:577
#10 0x000000000044f684 in netStatsLogFlowMetaData (instId=2, ruleId=10246, msg=0x7fff34002688)
at /home/release/peak2/tag/DCITS/protoAnalyzer/netStats/netStats.c:1071
#11 0x000000000044fa42 in netStatsRecordCb (msg=0x7fff34002650)
at /home/release/peak2/tag/DCITS/protoAnalyzer/netStats/netStats.c:1248
#12 0x00000000004538c4 in netStatsLogHandle (data=0x0)
at /home/release/peak2/tag/DCITS/protoAnalyzer/netStats/netStatsLog.c:617
#13 0x000000000045d2fa in protoAnalyzerDpdkProcess (arg=0x0)
at /home/release/peak2/tag/DCITS/protoAnalyzer/platform/platformLinuxDpdk.c:409
#14 0x00000000004d1f95 in eal_thread_loop ()
#15 0x00007ffff6349df5 in start_thread () from b64bpthread.so.0
#16 0x00007ffff60771ad in clone () from b64bc.so.6

Comment by tianlei.shi [ 29/Apr/16 ]

Thanks, I'll try that when I reproduce it again.

Comment by A. Jesse Jiryu Davis [ 28/Apr/16 ]

Hi, sending me an .so file won't help, thanks. If you can send me a program that reproduces the error, which I can compile and run, then I can diagnose the error. If you're unable to do that, please at least recompile libmongoc in debug mode:

./configure --enable-debug
make
sudo make install

Then the stack trace will include function names.

Comment by tianlei.shi [ 28/Apr/16 ]

I was using mongoc_bulk_operation_execute in a large system. And this issue isn't 100% reproducable. Sometimes it happens and after rebooting the OS it works fine again.

Could you find which line of code is for /opt/NetSensor/lib/libmongoc-1.0.so.0(+0x27ea1) [0x7f9ea3a02ea1]?
Do you need me to provide the .so file?

Comment by A. Jesse Jiryu Davis [ 27/Apr/16 ]

Can you please provide a Short, Self Contained, Compilable Example (http://sscce.org/) of code that reproduces this crash? Otherwise I cannot diagnose it. Thanks!

Generated at Wed Feb 07 21:11:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.