[GODRIVER-1540] library stuck in connection.wait() Created: 22/Mar/20  Updated: 28/Oct/23  Resolved: 27/Mar/20

Status: Closed
Project: Go Driver
Component/s: Connections
Affects Version/s: 1.3.0
Fix Version/s: 1.3.2

Type: Bug Priority: Major - P3
Reporter: Pierre Durand Assignee: Isabella Siu (Inactive)
Resolution: Fixed Votes: 0
Labels: planned-maintenance-detectable-bug
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux



 Description   

I've noticed new issues recently.

I don't know if it's related to the v1.3.0 update, but it never happened before.

Some MongoDB queries are stuck indefinitely.

It happens for different types of queries: run command, find.

Here is the common part of the stack trace:

go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*connection).wait(...)
	/home/travis/gopath/pkg/mod/go.mongodb.org/mongo-driver@v1.3.0/x/mongo/driver/topology/connection.go:181
go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*pool).closeConnection(0xc000410780, 0xc0008dce00, 0x0, 0xd7dd60)
	/home/travis/gopath/pkg/mod/go.mongodb.org/mongo-driver@v1.3.0/x/mongo/driver/topology/pool.go:417 +0x1f9
go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*connection).close(0xc0008dce00, 0xc000c741c8, 0xc00041d9d0)
	/home/travis/gopath/pkg/mod/go.mongodb.org/mongo-driver@v1.3.0/x/mongo/driver/topology/connection.go:302 +0x97
go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*connection).readWireMessage(0xc0008dce00, 0xd8aae0, 0xc000099740, 0xc0004d1400, 0x0, 0x200, 0xc0008dce00, 0xd8aae0, 0xc000099740, 0xc0004d1400, ...)
	/home/travis/gopath/pkg/mod/go.mongodb.org/mongo-driver@v1.3.0/x/mongo/driver/topology/connection.go:262 +0x4d8
go.mongodb.org/mongo-driver/x/mongo/driver/topology.initConnection.ReadWireMessage(0xc0008dce00, 0xd8aae0, 0xc000099740, 0xc0004d1400, 0x0, 0x200, 0xc0004d1400, 0x10e, 0x200, 0x0, ...)
	/home/travis/gopath/pkg/mod/go.mongodb.org/mongo-driver@v1.3.0/x/mongo/driver/topology/connection.go:351 +0x6a
go.mongodb.org/mongo-driver/x/mongo/driver.Operation.roundTrip(0xc000d74360, 0xba7b6c, 0x5, 0xd89b60, 0xc000d74370, 0xc000d74380, 0x0, 0x0, 0x0, 0x0, ...)
	/home/travis/gopath/pkg/mod/go.mongodb.org/mongo-driver@v1.3.0/x/mongo/driver/operation.go:552 +0x401
go.mongodb.org/mongo-driver/x/mongo/driver.Operation.Execute(0xc000d74360, 0xba7b6c, 0x5, 0xd89b60, 0xc000d74370, 0xc000d74380, 0x0, 0x0, 0x0, 0x0, ...)
	/home/travis/gopath/pkg/mod/go.mongodb.org/mongo-driver@v1.3.0/x/mongo/driver/operation.go:367 +0xc06
go.mongodb.org/mongo-driver/x/mongo/driver/operation.(*IsMaster).GetDescription(0xc000cda9a0, 0xd8aae0, 0xc000099740, 0xc0004243fa, 0x2b, 0xd8efa0, 0xc0008dce00, 0x0, 0x0, 0x0, ...)
	/home/travis/gopath/pkg/mod/go.mongodb.org/mongo-driver@v1.3.0/x/mongo/driver/operation/ismaster.go:418 +0x1ec
go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*connection).connect(0xc0008dce00, 0xd8aae0, 0xc000099740)
	/home/travis/gopath/pkg/mod/go.mongodb.org/mongo-driver@v1.3.0/x/mongo/driver/topology/connection.go:133 +0x2ac
go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*pool).get(0xc000410780, 0xd8aba0, 0xc0009ade90, 0x1, 0x0, 0x0)
	/home/travis/gopath/pkg/mod/go.mongodb.org/mongo-driver@v1.3.0/x/mongo/driver/topology/pool.go:380 +0x73c
go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*Server).Connection(0xc0008bc160, 0xd8aba0, 0xc0009ade90, 0x0, 0x0, 0x0, 0x0)
	/home/travis/gopath/pkg/mod/go.mongodb.org/mongo-driver@v1.3.0/x/mongo/driver/topology/server.go:243 +0x1ea
go.mongodb.org/mongo-driver/x/mongo/driver.Operation.Execute(0xc0012038d0, 0xc000424682, 0xe, 0xd88660, 0xc00028a000, 0xc0012038e0, 0xd7ed80, 0xc000279040, 0xc0000adf80, 0xc00027e220, ...)
	/home/travis/gopath/pkg/mod/go.mongodb.org/mongo-driver@v1.3.0/x/mongo/driver/operation.go:246 +0x207

I read quickly the source code, and I think there is a deadlock.

The connection.wait() function call is waiting for a channel to be closed.

As far as I know, this channel is closed only in another function call: connection.connect(). (defer)

But this function call is already in the calls stack, so the channel will never be closed.

That's why I think there is a deadlock.

 



 Comments   
Comment by Githook User [ 27/Mar/20 ]

Author:

{'name': 'iwysiu', 'username': 'iwysiu', 'email': 'isabella.siu@10gen.com'}

Message: GODRIVER-1540 fix deadlock in connection (#348)
Branch: release/1.3
https://github.com/mongodb/mongo-go-driver/commit/df8f93d4d617f50b22740e6763f79727ca14a7bf

Comment by Githook User [ 27/Mar/20 ]

Author:

{'email': 'isabella.siu@10gen.com', 'name': 'iwysiu', 'username': 'iwysiu'}

Message: GODRIVER-1540 fix deadlock in connection (#348)
Branch: master
https://github.com/mongodb/mongo-go-driver/commit/a2fd8774390ccd522c3769ef494552832196ca23

Comment by Divjot Arora (Inactive) [ 26/Mar/20 ]

Hi pierrre,

Apologies for the trouble this has caused you. This was a regression introduced in the 1.3.0 release. The root cause is that a connection that encounters a network error during its initial handshake tries to close itself and delete its reference in the connection pool, but the pool waits for the handshake to be complete. In 1.2.x, the pool did not wait for the handshake, but this was a data race between a connection being created and the pool being disconnected. We're fixing this by adding a new method to the connection pool to delete the reference without waiting. The connection handshake path will call this new method instead.

The regression was introduced in GODRIVER-1411, so I believe rolling back to 1.2.x should fix your issues. The ETA for v1.3.2 is currently April 7th.

 

– Divjot

Comment by Pierre Durand [ 26/Mar/20 ]

Hello !

I have a few question about this issue.

What is the root cause of this bug ? It doesn't seem to happen randomly. It occurs more often when my MongoDB clusters are unstable.

Did this bug appear in v1.3.0 ? I've never detected it with v1.2.x. Should I rollback ?

Is there an ETA for v1.3.2 ? Currently my application are very unstable. I've configured a very awkward monitoring solution: list periodically the stack traces of running goroutines, parse it, and detect the goroutines in deadlock. Then I need to restart my applications, which requires a manual action.

Generated at Thu Feb 08 08:36:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.