[CDRIVER-1940] Experiencing hang when try to clean up mongo connection pool Created: 30/Nov/16 Updated: 11/Sep/19 Resolved: 01/Feb/17 |
|
| Status: | Closed |
| Project: | C Driver |
| Component/s: | libmongoc |
| Affects Version/s: | 1.3.3 |
| Fix Version/s: | TBD |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Tanmoy Palit | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Windows 7 |
||
| Case: | (copied to CRM) |
| Description |
|
We are using smart pointers to initialize mongo connection pool. Following is code snippet:
But application seems to hang when mongoc_client_pool_destroy(p) is called. Further debugging shows it is waiting at the following code (mongoc-topology.c):
Although if we remove the mongoc_client_pool_destroy(p) then application seems to work fine. |
| Comments |
| Comment by A. Jesse Jiryu Davis [ 06/Mar/19 ] | |||||
|
Oh good point. Then I plan to leave it closed! | |||||
| Comment by Michael Chadwick [ 05/Mar/19 ] | |||||
|
The ticket was already closed. I am commenting on this ~2 year old closed ticket because I was having the same issue. | |||||
| Comment by A. Jesse Jiryu Davis [ 05/Mar/19 ] | |||||
|
Thanks for your reply. I think we should close this ticket. It sounds like the C Driver works as designed when used in the expected way: destroy all C Driver objects before the C Driver DLL is unloaded. If we continued trying to diagnose this issue it seems we'd have difficulty even reproducing it, much less diagnosing it, and once we had diagnosed it we might just return to the same conclusion: destroy all C Driver objects before the C Driver DLL is unloaded. | |||||
| Comment by Michael Chadwick [ 05/Mar/19 ] | |||||
|
Hmm, that's a good question. Maybe it got unloaded and then is getting reloaded when the destructor of the static object needs to call `mongoc_client_pool_destroy`. This stuff is happening just below my sphere of knowledge. I can't say definitively what's the root cause, but I can tell you definitively that when the destructor of my static object calls `mongoc_client_pool_destroy`, the topology background thread is no longer there. But from the code it looks to me like there should be no way for that thread to be gone before that procedure is called. I can also say definitively that if I call `mongoc_client_pool_destroy` before returning from main, the problem goes away (my static object sees there is nothing to destroy and so does nothing). | |||||
| Comment by A. Jesse Jiryu Davis [ 04/Mar/19 ] | |||||
|
I don't quite understand the scenario. If the libmongoc DLL has been unloaded, how is it possible to call the function mongoc_client_pool_destroy, which is defined in that DLL? | |||||
| Comment by Michael Chadwick [ 04/Mar/19 ] | |||||
|
What you expect is what I would also expect. I didn't see any documentation for `WakeConditionVariable` that would imply it blocks, but I didn't see anything about it not blocking either. I searched the mongo code base for `cond_server`. I found one method that waited on that condition variable (I forget which one) and I put some breakpoints throughout. When those breakpoints were hit, I noticed that they were hit on a named thread running inside the mongo dll. After passing through the function a few times I stepped over the block where it waits for the condition variable and the debugger never returned to that function. The next thing to happen was my program returned from main and I hit a breakpoint inside the destructor of a static object in my code. At this point the only thread running in the program was the main thread. The static object calls `mongo_client_pool_destroy` which gets to `mongoc_cond_signal` and then hangs indefinitely (it ran for almost two hours before Jenkins killed it) in "[External Code]", presumably the windows impl for WakeConditionVariable. Side Note: I could only reproduce this by debugging in my Jenkins VM. None of my devs's machines could reproduce the issue. This does not surprise me since the order of destruction of static and global objects is undefined. Presumably one could reproduce by manually loading and unloading the mongo dll, then calling `mongo_client_pool_destroy`. | |||||
| Comment by A. Jesse Jiryu Davis [ 01/Mar/19 ] | |||||
|
mongoc_cond_signal on Windows is an alias for WakeConditionVariable. I'd expect on Windows it acts the same as condition variables everywhere: it signals a condition variable but doesn't block waiting for another thread to do anything. Consider what might happen when you signal a condition variable for which no other thread is waiting! Can you say more about what diagnostic steps you took that led to your conclusion? | |||||
| Comment by Michael Chadwick [ 01/Mar/19 ] | |||||
|
This issue can occur when
is called in the destructor of a static or global object in a dll that gets unloaded after the mongo dll has been unloaded. For whatever reason on Windows,
blocks waiting for the other thread to wake up, but the other thread doesn't exist and so it never wakes up. This is difficult to reproduce since destruction order of statics/globals is undefined. One possible way to reproduce it would be to manually load and unload the dlls via OS-level APIs. The client solution is to ensure that
is called before mongo is unloaded. It does seem like there should be some kind of refactor or fix on the mongo side to ensure that
doesn't block. Side note: it is
that was hanging, not the assignment of the enum on the next line. I don't know about other IDEs, but Visual Studio tends to show the debugger paused on the line that is about to execute. |