[GODRIVER-1540] library stuck in connection.wait() Created: 22/Mar/20 Updated: 28/Oct/23 Resolved: 27/Mar/20 |
|
| Status: | Closed |
| Project: | Go Driver |
| Component/s: | Connections |
| Affects Version/s: | 1.3.0 |
| Fix Version/s: | 1.3.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Pierre Durand | Assignee: | Isabella Siu (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | planned-maintenance-detectable-bug | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux |
||
| Description |
|
I've noticed new issues recently. I don't know if it's related to the v1.3.0 update, but it never happened before. Some MongoDB queries are stuck indefinitely. It happens for different types of queries: run command, find. Here is the common part of the stack trace:
I read quickly the source code, and I think there is a deadlock. The connection.wait() function call is waiting for a channel to be closed. As far as I know, this channel is closed only in another function call: connection.connect(). (defer) But this function call is already in the calls stack, so the channel will never be closed. That's why I think there is a deadlock.
|
| Comments |
| Comment by Githook User [ 27/Mar/20 ] |
|
Author: {'name': 'iwysiu', 'username': 'iwysiu', 'email': 'isabella.siu@10gen.com'}Message: |
| Comment by Githook User [ 27/Mar/20 ] |
|
Author: {'email': 'isabella.siu@10gen.com', 'name': 'iwysiu', 'username': 'iwysiu'}Message: |
| Comment by Divjot Arora (Inactive) [ 26/Mar/20 ] |
|
Hi pierrre, Apologies for the trouble this has caused you. This was a regression introduced in the 1.3.0 release. The root cause is that a connection that encounters a network error during its initial handshake tries to close itself and delete its reference in the connection pool, but the pool waits for the handshake to be complete. In 1.2.x, the pool did not wait for the handshake, but this was a data race between a connection being created and the pool being disconnected. We're fixing this by adding a new method to the connection pool to delete the reference without waiting. The connection handshake path will call this new method instead. The regression was introduced in
– Divjot |
| Comment by Pierre Durand [ 26/Mar/20 ] |
|
Hello ! I have a few question about this issue. What is the root cause of this bug ? It doesn't seem to happen randomly. It occurs more often when my MongoDB clusters are unstable. Did this bug appear in v1.3.0 ? I've never detected it with v1.2.x. Should I rollback ? Is there an ETA for v1.3.2 ? Currently my application are very unstable. I've configured a very awkward monitoring solution: list periodically the stack traces of running goroutines, parse it, and detect the goroutines in deadlock. Then I need to restart my applications, which requires a manual action. |