[CDRIVER-2507] MongoC 1.81 Crashed in mongoc_cluster_init() Created: 14/Feb/18  Updated: 08/Feb/23  Resolved: 08/Mar/18

Status: Closed
Project: C Driver
Component/s: libmongoc
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: zhongxi yuan Assignee: A. Jesse Jiryu Davis
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows, VS2017


Attachments: PNG File WorkerShell.exe.25044.dmp.png     File case2507.7z     PNG File screenshot-1.png     PNG File screenshot-2.png     PNG File screenshot-3.png    

 Description   

ucrtbase.dll!abort() Unknown Non-user code. Symbols loaded.
libmongoc-1.0.dll!mongoc_topology_description_handle_ismaster() C Non-user code. Symbols loaded.
libmongoc-1.0.dll!mongoc_topology_description_invalidate_server() C Non-user code. Symbols loaded.
libmongoc-1.0.dll!mongoc_topology_invalidate_server() C Non-user code. Symbols loaded.
libmongoc-1.0.dll!mongoc_cluster_disconnect_node() C Non-user code. Symbols loaded.
libmongoc-1.0.dll!mongoc_cluster_init() C Non-user code. Symbols loaded.
libmongoc-1.0.dll!mongoc_client_pool_try_pop() C Non-user code. Symbols loaded.
libmongoc-1.0.dll!mongoc_client_pool_try_pop() C Non-user code. Symbols loaded.
> libmongoc-1.0.dll!mongoc_client_pool_try_pop() C Non-user code. Symbols loaded.
libmongoc-1.0.dll!_mongoc_cluster_stream_for_server() C Non-user code. Symbols loaded.
libmongoc-1.0.dll!_mongoc_cluster_buffer_iovec() C Non-user code. Symbols loaded.
libmongoc-1.0.dll!_mongoc_cursor_fetch_stream() C Non-user code. Symbols loaded.
libmongoc-1.0.dll!_mongoc_cursor_get_opt_bool() C Non-user code. Symbols loaded.
libmongoc-1.0.dll!_mongoc_cursor_next() C Non-user code. Symbols loaded.
libmongoc-1.0.dll!mongoc_cursor_next() C Non-user code. Symbols loaded.
MongoDBWrapper.dll!00007ffcd7f5f8c7() Unknown No symbols loaded.



 Comments   
Comment by A. Jesse Jiryu Davis [ 27/Mar/18 ]

Hi! Thanks for the additional information. However, I will only be able to diagnose this problem if you are able to provide a small C program that reproduces this crash, that I can compile and execute on my computer. Thanks.

Comment by zhongxi yuan [ 27/Mar/18 ]

Follow is the crash dump and the PDB file of mongoc1.81 i built, my environment is visual studio 2017
case2507.7z

Comment by zhongxi yuan [ 27/Mar/18 ]

Comment by zhongxi yuan [ 27/Mar/18 ]

Hi There,
I use the release version with debug info, today, it crashed again, and the position was same with last time.
Can you take a look?

Comment by A. Jesse Jiryu Davis [ 08/Mar/18 ]

Hi, I'm closing this bug for now. Please reopen it if you are able to provide a small C program that I can compile and execute on my computer that reproduces this crash. Thanks.

Comment by A. Jesse Jiryu Davis [ 01/Mar/18 ]

Let me know if you have any additional information to help us debug this. Thanks!

Comment by A. Jesse Jiryu Davis [ 22/Feb/18 ]

Yes, please build with debug symbols. I hope that will fix the stack trace so we can be more certain about what happened.

Comment by zhongxi yuan [ 22/Feb/18 ]

Thanks for explaining it in details.
I understand there is not much you can do at the moment. On my side, one thing that I can do is to build mongoc with debugging symbols enabled. So next time when the crash happens, I can provide more details of input arguments. Feel free to suggest if there is anything else I can to help troubleshooting this issue.

Comment by A. Jesse Jiryu Davis [ 21/Feb/18 ]

The initial call to mongoc_collection_find does not actually send a message to the MongoDB server at all, it only creates the cursor struct. Therefore, if you call mongoc_cursor_error right after mongoc_collection_find, the failover won't cause mongoc_cursor_error to return an error. The cursor doesn't know the server has failed, because it hasn't tried to send a message yet.

When you make the first call to mongoc_cursor_next, the cursor sends its "find" message to the server. Since the server has failed over, the cursor receives a network error and mongoc_cursor_next returns false. Then, mongoc_cursor_error returns true and reports an error message "socket error or timeout." I've verified this behavior in my own testing of the 1.8.1 driver. The driver returns an error and it does not crash.

Comment by zhongxi yuan [ 21/Feb/18 ]

Thanks for your investigating and inputs. It's valuable information that a failover would not cause a crash of cursor accessing.

I have excluded the possibility of using the same mongoc_client_t in multiple threads. The application uses mongoc_client_pool_t to manage a pool of connections. When a thread has to access MongoDB, it requests for a dedicated connection from the pool.

As you mentioned>>>>>>>>>>>>>
The driver returns an error as expected: Cursor Failure: Failed to send "getMore" command with database "test": socket error or timeout. The driver does not crash.

Is this error returned by mongoc_cursor_error(cursor.get(), &error)? In this case, the error can be detected and the program will quit.
But is it possible that failover happens after mongoc_cursor_error() but right before mongoc_cursor_next()? Will it cause an un-handled situation and a crash?

Comment by A. Jesse Jiryu Davis [ 21/Feb/18 ]

I don't have much insight to provide for you. The stack appears to be corrupt: it shows an impossible series of function calls. For example, _mongoc_cursor_get_opt_bool appears to call _mongoc_cursor_fetch_stream in your stack trace, but _mongoc_cursor_get_opt_bool does not actually call that function.

You asked if I believe this is related to failover, and the answer is "yes." At the moment your program crashes, the driver is trying to send a "find" message to the primary server in order to get the first result for the cursor. (You have set the cursor's limit to 1.) The primary has failed over, so the driver gets a network error. The driver updates its internal data structures to remember that the server is unavailable, in mongoc_cluster_disconnect_node. I have tested this sequence of events and the driver behaves correctly. It returns an error from the cursor, and it does not crash for me.

If you want to verify whether one of the two BSON_ASSERT statements in mongoc_topology_description_handle_ismaster is responsible for the crash, try adding these lines to the beginning of mongoc_topology_description_handle_ismaster:

   if (!topology || server_id == 0) {
      MONGOC_ERROR ("topology = %p, server+id = %" PRIu32, topology, server_id);
   }
 
   BSON_ASSERT (topology);
   BSON_ASSERT (server_id != 0);

I have to advise you, since I cannot reproduce this crash on my own system, and since no one else has reported a crash similar to this one, I suspect the problem is in your application. You may have a bug that causes memory corruption. If you provide me a code example that I can compile that reproduces this crash then I will certainly investigate further.

One more question: are you using a single mongoc_client_t, or a single mongoc_cursor_t, from multiple threads? If you are, then this might be the cause. Only mongoc_client_pool_t is thread-safe, all other functions and structs in the C Driver are not thread-safe and must not be used from multiple threads.

Comment by zhongxi yuan [ 21/Feb/18 ]

Hi, Jesse

Thanks for replying. The MongoDB version number is 3.2.18-5-g8a10308.

I don't have a short sample code example that can reproduce the crash. The crash is found in production system and happens once in a few days. I will keep an eye on the system and try to collect more information. In the mean time, it will be helpful if you can provide some insight based on the call stack and crashing code. If you believe this is related to faillover, then I shall arrange some time to test MongoDB failover.

Comment by A. Jesse Jiryu Davis [ 20/Feb/18 ]

Thanks, I've tried to reproduce this crash using a large collection of documents in a replica set and querying it with mongoc-client. I can test what happens with C Driver 1.8.1 when the cursor begins on the primary and then the primary shuts down. The driver returns an error as expected: Cursor Failure: Failed to send "getMore" command with database "test": socket error or timeout. The driver does not crash.

In order to try to reproduce this, please tell me what MongoDB version you use and please provide a short code example that I can compile and run on my computer which will reproduce this crash. Thanks very much.

Comment by zhongxi yuan [ 20/Feb/18 ]

froderik described a mongodb cursor issue that may be related.

https://stackoverflow.com/questions/36766956/what-is-a-cursor-in-mongodb

I am by no mean a mongodb expert but I just want to add some observations from working in a medium sized mongo system for the last year. Also thanks to @xameeramir for the excellent walkthough about how cursors work in general.

The causes of a "cursor lost" exception may be several. One that I have noticed is explained in this answer.

The cursor lives server side. It is not distributed over a replica set but exists on the instance that is primary at the time of creation. This means that if another instance takes over as primary the cursor will be lost to the client. If the old primary is still up and around it may still be there but for no use. I guess it is garbaged collected away after a while. So if your mongo replica set is unstable or you have a shaky network in front of it you are out of luck when doing any long running queries.

If the full content of what the cursor wants to return does not fit in memory on the server the query may be very slow. RAM on your servers needs to be larger than the largest query you run.

All this can partly be avoided by designing better. For a use case with large long running queries you may be better of with several smaller database collections instead of a big one.

Comment by zhongxi yuan [ 20/Feb/18 ]

I attached a screenshot of the callstack. The same code snippet validates the cursor before access its content. But it still crashes inside mongoc_cursor_next().

It looks like a failover happens when proceeding with mongoc_cursor_next(). It will be helpful if you can share your opinions on potential root cause of the crash.

>>>>>>>>>>>>>>>>>>>>>sample code>>>>>>>>>>>>>>>>

 		ScopeCollection collection(&connection, ns);
		auto cursor = std::unique_ptr<mongoc_cursor_t, void(*)(mongoc_cursor_t*)>(mongoc_collection_find(collection.coll(), MONGOC_QUERY_NONE, 0, 1, 0, query.get(), rtnFields.empty() ? NULL : fieldsToReturn.get(), NULL), mongoc_cursor_destroy);
		bson_error_t error;
		if (mongoc_cursor_error(cursor.get(), &error)) {
			return false;
		}
		const bson_t *doc;
		if (mongoc_cursor_next(cursor.get(), &doc))
		{ ...
 

Comment by zhongxi yuan [ 18/Feb/18 ]

I am not sure it was in 1768 or 1769, from disassembly I guss it was.
my project is in virsual studio 2017

I will share the code later and will try to reproduce
Thanks for response so quickly

Comment by A. Jesse Jiryu Davis [ 14/Feb/18 ]

Thanks for the report. Do you have any evidence that libmongoc is aborting at mongoc-topology-description.c on line 1768 or 1769? Or is that just a guess?

Could you please share a short code example that I can compile which will reproduce this crash?

And finally, could you please reproduce this crash in your own system with a build of libmongoc that has debug symbols enabled? That will give us a more detailed stack trace.

Comment by zhongxi yuan [ 14/Feb/18 ]

Comment by zhongxi yuan [ 14/Feb/18 ]

Generated at Wed Feb 07 21:15:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.