[GODRIVER-2438] Pool "closeConnection" method cause panics on removePerishedConns() Created: 27/May/22  Updated: 23/Jan/24  Resolved: 30/Nov/22

Status: Closed
Project: Go Driver
Component/s: None
Affects Version/s: 1.8.4, 1.9.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Andrii Hrachov Assignee: Matt Dale
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by GODRIVER-2467 Return a Context error if WithTransac... Closed
Related
related to GODRIVER-2432 Improve panic handling in background ... Closed
related to GODRIVER-2464 Add timeout for RTT monitor "hello" o... Closed

 Description   

Summary

Sometimes panics occurs and application is restarted.

Driver versions -> 1.8.4, 1.9.1

How to Reproduce

No clear WTR, it's happening from time to time on production deployments

Additional Background

Stack trace:

/app/vendor/go.mongodb.org/mongo-driver/x/mongo/driver/topology/pool.go:772 +0x198
created by go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*pool).removePerishedConns
/app/vendor/go.mongodb.org/mongo-driver/x/mongo/driver/topology/pool.go:773 +0x25
go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*pool).removePerishedConns.func1()
/app/vendor/go.mongodb.org/mongo-driver/x/mongo/driver/topology/pool.go:400 +0x75
go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*pool).closeConnection(0xc000714a80?, 0xc0004db400)
/app/vendor/go.mongodb.org/mongo-driver/x/mongo/driver/topology/connection.go:499 +0x43
go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*connection).close(0xc0004db400?)
/usr/local/go/src/crypto/tls/conn.go:1338 +0x69
crypto/tls.(*Conn).Close(0xc00080d701?)
/usr/local/go/src/crypto/tls/conn.go:1368 +0xdb
crypto/tls.(*Conn).closeNotify(0xc00162ea80)
/usr/local/go/src/crypto/tls/conn.go:151
crypto/tls.(*Conn).SetWriteDeadline(...)
/usr/local/go/src/net/net.go:261 +0x48
net.(*conn).SetWriteDeadline(0xc00000ed10, {0x51b5b106590?, 0x1596c00?, 0x1596c00?})
/usr/local/go/src/net/fd_posix.go:146
net.(*netFD).SetWriteDeadline(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:142
internal/poll.(*FD).SetWriteDeadline(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:160 +0x1be
internal/poll.setDeadlineImpl(0xc0022c38c0, {0x2?, 0x0?, 0x1596c00?}, 0x4c17b6?)
/usr/local/go/src/runtime/netpoll.go:324 +0x3e
internal/poll.runtime_pollSetDeadline(0x2, 0x12a05f0f3, 0x77)
goroutine 223004 [running]:
[signal SIGSEGV: segmentation violation code=0x1 addr=0x2 pc=0x46365e]

Solution:

Handle panics inside closeConnection function - Pull request



 Comments   
Comment by PM Bot [ 30/Nov/22 ]

There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to comment on this if you're able to provide more information.

Comment by Matt Dale [ 16/Nov/22 ]

andrii.hrachov@deliveryhero.com after further discussion with the team, we decided to not accept the proposed PR that catches panics when closing connections via a "defer recover" block. "Defer recover" blocks are difficult to remove once added, especially if we can't reproduce the panic scenario.

Do you still encounter panics when closing connections?

Comment by Andrii Hrachov [ 30/Jun/22 ]

Thanks a lot, Matt!

Should open bug ticket to golang now  

Comment by Matt Dale [ 29/Jun/22 ]

andrii.hrachov@deliveryhero.com I haven't been able to reproduce the exact panic condition you described, but it makes sense that if there is any possibility that closing a net.Conn can panic, we should recover that panic and return it as an error. I've opened PR 1007 that adds a defer recover block to connection.close(). I've also created PR 994 that should prevent the RTT monitor from getting stuck running an operation indefinitely.

The original stack trace that you posted suggests the panic is actually coming from somewhere in Go's "netpoller", which could point to a bug in Go's network connection code or memory corruption caused by data races. It may be helpful to build and run your application with the data race detector enabled (go build -race) to check for memory corruption. There are also a few similar open issues on the github.com/golang/go repository, including one about segfault panics from Go apps running in Docker. You could consider opening a new issue on the Go project repo with a description of your problem, optionally linking to this ticket. If you do open an issue on the Go project, please add a link to that issue here as well!

The two linked PRs are scheduled for release with Go Driver v1.10.0, which will be released in late July.

Comment by Matt Dale [ 16/Jun/22 ]

andrii.hrachov@deliveryhero.com that's really helpful information! I'm going to try to create a reproducible example of the panic conditions you described. I'm also evaluating our use of "defer recover" blocks in the Go Driver to see if there are more or different places they should be added. I'll respond again when I have more information.

Comment by Andrii Hrachov [ 01/Jun/22 ]

So what i see:

When connections are closed above application level (os, mesh etc), mongodb driver panicks when trying to close connection or measure RTT

Comment by Andrii Hrachov [ 01/Jun/22 ]

1) 1.9.1
2) 4.2
3) 

```

clientOptions := options.
Client().
SetRegistry(m.registry).
SetWriteConcern(writeconcern.New(writeconcern.WMajority())).
SetReadPreference(readPreference.Primary()).
SetDisableOCSPEndpointCheck(true).
SetServerSelectionTimeout(120 * time.Second).
SetMaxPoolSize(100).
SetMinPoolSize(20).
SetMaxConnIdleTime(60 * time.Second)```

4) 1.18

5) I already lost the stack trace but the problem was in RTT monitor, my fix was;

```
func (r *rttMonitor) runHello(conn *connection) *connection {
defer func() {
if r := recover(); r != nil {
if conn != nil && !conn.closed()

{ _ = conn.close() }

}
}()
```

The exact line was panicking:

```
err := r.cfg.createOperationFn(initConnection

{conn}

).Execute(r.ctx)
```

More info investigated:

Actually with forking driver and putting recovers in connection pool background operations helped, but there were 3 more contribution factors:

1) Some requests were dangling forever (no context cancellation), so when the fd poll timed out - driver panicked, because the connection was already gone on system side. Introducing app timeout helped to mitigate some problems but not all of them
2) We were using istio for everything and it gave sporadic tls connection failures, this also caused connections to be tampered and also resulted to panic on the driver side (when it tries to measure RTT or close the connection)
3) also connection pool idle timeout was not set, thus it was resulting to connections close by OS, and driver also panicking when trying to reap them

Comment by Matt Dale [ 31/May/22 ]

Hey andrii.hrachov@deliveryhero.com thanks for the ticket! I have a few questions to help me identify the issue:

  1. What version of the Go Driver were you using when you captured the stack trace you provided?
  2. What version of the MongoDB server are you using?
  3. Can you post a sample of the MongoDB Client configuration you're using?
  4. What version of Go are you using?
  5. You mention you also get a panic from runHello. Can you post a stack trace of that panic as well?

Thanks,
Matt

Comment by Andrii Hrachov [ 29/May/22 ]

Actually I forked the driver to fix that, and realised that another place is also need attention -

runHello function. So actually the ticket can be unified- "Handle panics in connection pool background operations"

Generated at Thu Feb 08 08:38:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.