[GODRIVER-2438] Pool "closeConnection" method cause panics on removePerishedConns() Created: 27/May/22 Updated: 23/Jan/24 Resolved: 30/Nov/22 |
|
| Status: | Closed |
| Project: | Go Driver |
| Component/s: | None |
| Affects Version/s: | 1.8.4, 1.9.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Andrii Hrachov | Assignee: | Matt Dale |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Description |
SummarySometimes panics occurs and application is restarted. Driver versions -> 1.8.4, 1.9.1How to ReproduceNo clear WTR, it's happening from time to time on production deploymentsAdditional Background Stack trace:
Solution: Handle panics inside closeConnection function - Pull request |
| Comments |
| Comment by PM Bot [ 30/Nov/22 ] |
|
There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to comment on this if you're able to provide more information. |
| Comment by Matt Dale [ 16/Nov/22 ] |
|
andrii.hrachov@deliveryhero.com after further discussion with the team, we decided to not accept the proposed PR that catches panics when closing connections via a "defer recover" block. "Defer recover" blocks are difficult to remove once added, especially if we can't reproduce the panic scenario. Do you still encounter panics when closing connections? |
| Comment by Andrii Hrachov [ 30/Jun/22 ] |
|
Thanks a lot, Matt! Should open bug ticket to golang now |
| Comment by Matt Dale [ 29/Jun/22 ] |
|
andrii.hrachov@deliveryhero.com I haven't been able to reproduce the exact panic condition you described, but it makes sense that if there is any possibility that closing a net.Conn can panic, we should recover that panic and return it as an error. I've opened PR 1007 that adds a defer recover block to connection.close(). I've also created PR 994 that should prevent the RTT monitor from getting stuck running an operation indefinitely. The original stack trace that you posted suggests the panic is actually coming from somewhere in Go's "netpoller", which could point to a bug in Go's network connection code or memory corruption caused by data races. It may be helpful to build and run your application with the data race detector enabled (go build -race) to check for memory corruption. There are also a few similar open issues on the github.com/golang/go repository, including one about segfault panics from Go apps running in Docker. You could consider opening a new issue on the Go project repo with a description of your problem, optionally linking to this ticket. If you do open an issue on the Go project, please add a link to that issue here as well! The two linked PRs are scheduled for release with Go Driver v1.10.0, which will be released in late July. |
| Comment by Matt Dale [ 16/Jun/22 ] |
|
andrii.hrachov@deliveryhero.com that's really helpful information! I'm going to try to create a reproducible example of the panic conditions you described. I'm also evaluating our use of "defer recover" blocks in the Go Driver to see if there are more or different places they should be added. I'll respond again when I have more information. |
| Comment by Andrii Hrachov [ 01/Jun/22 ] |
|
So what i see: When connections are closed above application level (os, mesh etc), mongodb driver panicks when trying to close connection or measure RTT |
| Comment by Andrii Hrachov [ 01/Jun/22 ] |
|
1) 1.9.1 ``` clientOptions := options. 4) 1.18 5) I already lost the stack trace but the problem was in RTT monitor, my fix was; ``` } The exact line was panicking: ``` ).Execute(r.ctx) More info investigated: Actually with forking driver and putting recovers in connection pool background operations helped, but there were 3 more contribution factors: 1) Some requests were dangling forever (no context cancellation), so when the fd poll timed out - driver panicked, because the connection was already gone on system side. Introducing app timeout helped to mitigate some problems but not all of them |
| Comment by Matt Dale [ 31/May/22 ] |
|
Hey andrii.hrachov@deliveryhero.com thanks for the ticket! I have a few questions to help me identify the issue:
Thanks, |
| Comment by Andrii Hrachov [ 29/May/22 ] |
|
Actually I forked the driver to fix that, and realised that another place is also need attention - runHello function. So actually the ticket can be unified- "Handle panics in connection pool background operations" |