[CDRIVER-2507] MongoC 1.81 Crashed in mongoc_cluster_init() Created: 14/Feb/18 Updated: 08/Feb/23 Resolved: 08/Mar/18 |
|
| Status: | Closed |
| Project: | C Driver |
| Component/s: | libmongoc |
| Affects Version/s: | 1.8.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | zhongxi yuan | Assignee: | A. Jesse Jiryu Davis |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Windows, VS2017 |
||
| Attachments: |
|
| Description |
|
ucrtbase.dll!abort() Unknown Non-user code. Symbols loaded. |
| Comments |
| Comment by A. Jesse Jiryu Davis [ 27/Mar/18 ] | |||||||||||
|
Hi! Thanks for the additional information. However, I will only be able to diagnose this problem if you are able to provide a small C program that reproduces this crash, that I can compile and execute on my computer. Thanks. | |||||||||||
| Comment by zhongxi yuan [ 27/Mar/18 ] | |||||||||||
|
Follow is the crash dump and the PDB file of mongoc1.81 i built, my environment is visual studio 2017 | |||||||||||
| Comment by zhongxi yuan [ 27/Mar/18 ] | |||||||||||
| Comment by zhongxi yuan [ 27/Mar/18 ] | |||||||||||
|
Hi There, | |||||||||||
| Comment by A. Jesse Jiryu Davis [ 08/Mar/18 ] | |||||||||||
|
Hi, I'm closing this bug for now. Please reopen it if you are able to provide a small C program that I can compile and execute on my computer that reproduces this crash. Thanks. | |||||||||||
| Comment by A. Jesse Jiryu Davis [ 01/Mar/18 ] | |||||||||||
|
Let me know if you have any additional information to help us debug this. Thanks! | |||||||||||
| Comment by A. Jesse Jiryu Davis [ 22/Feb/18 ] | |||||||||||
|
Yes, please build with debug symbols. I hope that will fix the stack trace so we can be more certain about what happened. | |||||||||||
| Comment by zhongxi yuan [ 22/Feb/18 ] | |||||||||||
|
Thanks for explaining it in details. | |||||||||||
| Comment by A. Jesse Jiryu Davis [ 21/Feb/18 ] | |||||||||||
|
The initial call to mongoc_collection_find does not actually send a message to the MongoDB server at all, it only creates the cursor struct. Therefore, if you call mongoc_cursor_error right after mongoc_collection_find, the failover won't cause mongoc_cursor_error to return an error. The cursor doesn't know the server has failed, because it hasn't tried to send a message yet. When you make the first call to mongoc_cursor_next, the cursor sends its "find" message to the server. Since the server has failed over, the cursor receives a network error and mongoc_cursor_next returns false. Then, mongoc_cursor_error returns true and reports an error message "socket error or timeout." I've verified this behavior in my own testing of the 1.8.1 driver. The driver returns an error and it does not crash. | |||||||||||
| Comment by zhongxi yuan [ 21/Feb/18 ] | |||||||||||
|
Thanks for your investigating and inputs. It's valuable information that a failover would not cause a crash of cursor accessing. I have excluded the possibility of using the same mongoc_client_t in multiple threads. The application uses mongoc_client_pool_t to manage a pool of connections. When a thread has to access MongoDB, it requests for a dedicated connection from the pool. As you mentioned>>>>>>>>>>>>> Is this error returned by mongoc_cursor_error(cursor.get(), &error)? In this case, the error can be detected and the program will quit. | |||||||||||
| Comment by A. Jesse Jiryu Davis [ 21/Feb/18 ] | |||||||||||
|
I don't have much insight to provide for you. The stack appears to be corrupt: it shows an impossible series of function calls. For example, _mongoc_cursor_get_opt_bool appears to call _mongoc_cursor_fetch_stream in your stack trace, but _mongoc_cursor_get_opt_bool does not actually call that function. You asked if I believe this is related to failover, and the answer is "yes." At the moment your program crashes, the driver is trying to send a "find" message to the primary server in order to get the first result for the cursor. (You have set the cursor's limit to 1.) The primary has failed over, so the driver gets a network error. The driver updates its internal data structures to remember that the server is unavailable, in mongoc_cluster_disconnect_node. I have tested this sequence of events and the driver behaves correctly. It returns an error from the cursor, and it does not crash for me. If you want to verify whether one of the two BSON_ASSERT statements in mongoc_topology_description_handle_ismaster is responsible for the crash, try adding these lines to the beginning of mongoc_topology_description_handle_ismaster:
I have to advise you, since I cannot reproduce this crash on my own system, and since no one else has reported a crash similar to this one, I suspect the problem is in your application. You may have a bug that causes memory corruption. If you provide me a code example that I can compile that reproduces this crash then I will certainly investigate further. One more question: are you using a single mongoc_client_t, or a single mongoc_cursor_t, from multiple threads? If you are, then this might be the cause. Only mongoc_client_pool_t is thread-safe, all other functions and structs in the C Driver are not thread-safe and must not be used from multiple threads. | |||||||||||
| Comment by zhongxi yuan [ 21/Feb/18 ] | |||||||||||
|
Hi, Jesse Thanks for replying. The MongoDB version number is 3.2.18-5-g8a10308. I don't have a short sample code example that can reproduce the crash. The crash is found in production system and happens once in a few days. I will keep an eye on the system and try to collect more information. In the mean time, it will be helpful if you can provide some insight based on the call stack and crashing code. If you believe this is related to faillover, then I shall arrange some time to test MongoDB failover. | |||||||||||
| Comment by A. Jesse Jiryu Davis [ 20/Feb/18 ] | |||||||||||
|
Thanks, I've tried to reproduce this crash using a large collection of documents in a replica set and querying it with mongoc-client. I can test what happens with C Driver 1.8.1 when the cursor begins on the primary and then the primary shuts down. The driver returns an error as expected: Cursor Failure: Failed to send "getMore" command with database "test": socket error or timeout. The driver does not crash. In order to try to reproduce this, please tell me what MongoDB version you use and please provide a short code example that I can compile and run on my computer which will reproduce this crash. Thanks very much. | |||||||||||
| Comment by zhongxi yuan [ 20/Feb/18 ] | |||||||||||
|
froderik described a mongodb cursor issue that may be related. https://stackoverflow.com/questions/36766956/what-is-a-cursor-in-mongodb I am by no mean a mongodb expert but I just want to add some observations from working in a medium sized mongo system for the last year. Also thanks to @xameeramir for the excellent walkthough about how cursors work in general. The causes of a "cursor lost" exception may be several. One that I have noticed is explained in this answer. The cursor lives server side. It is not distributed over a replica set but exists on the instance that is primary at the time of creation. This means that if another instance takes over as primary the cursor will be lost to the client. If the old primary is still up and around it may still be there but for no use. I guess it is garbaged collected away after a while. So if your mongo replica set is unstable or you have a shaky network in front of it you are out of luck when doing any long running queries. If the full content of what the cursor wants to return does not fit in memory on the server the query may be very slow. RAM on your servers needs to be larger than the largest query you run. All this can partly be avoided by designing better. For a use case with large long running queries you may be better of with several smaller database collections instead of a big one. | |||||||||||
| Comment by zhongxi yuan [ 20/Feb/18 ] | |||||||||||
|
I attached a screenshot of the callstack. The same code snippet validates the cursor before access its content. But it still crashes inside mongoc_cursor_next(). It looks like a failover happens when proceeding with mongoc_cursor_next(). It will be helpful if you can share your opinions on potential root cause of the crash. >>>>>>>>>>>>>>>>>>>>>sample code>>>>>>>>>>>>>>>>
| |||||||||||
| Comment by zhongxi yuan [ 18/Feb/18 ] | |||||||||||
|
I am not sure it was in 1768 or 1769, from disassembly I guss it was. I will share the code later and will try to reproduce | |||||||||||
| Comment by A. Jesse Jiryu Davis [ 14/Feb/18 ] | |||||||||||
|
Thanks for the report. Do you have any evidence that libmongoc is aborting at mongoc-topology-description.c on line 1768 or 1769? Or is that just a guess? Could you please share a short code example that I can compile which will reproduce this crash? And finally, could you please reproduce this crash in your own system with a build of libmongoc that has debug symbols enabled? That will give us a more detailed stack trace. | |||||||||||
| Comment by zhongxi yuan [ 14/Feb/18 ] | |||||||||||
| Comment by zhongxi yuan [ 14/Feb/18 ] | |||||||||||