[GODRIVER-2390] repeated listCollections & collStats causing connection disconnect Created: 22/Apr/22  Updated: 13/Jun/22  Resolved: 13/Jun/22

Status: Closed
Project: Go Driver
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vladimir Isakov (Inactive) Assignee: Preston Vasquez
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongodb.log    
Issue Links:
Related

 Description   

Summary

The Cloud InTel team has a feature called Data Explorer which uses the automation agent to send commands down to the mongod. When this feature runs on an evergreen box with a standalone mongod 4.2 (mostly fails on a 4.2 standalone but once in a while also fails on newer versions of mongodb and even on a replica set) and an automation agent (which uses a go driver version v1.7.2-0.20210804202952-82d3467025bc) it ends up causing the agent to lose its connection to the mongod when it issues particular commands.

Specifically, we end up populating the mongod with many databases and collections via a js script, then the agent iterates through all of the dbs and for each db it calls listCollections on the mongod. Sometimes one of those listCollections ends up failing due to a closed connection. In the situations when the listCollections end up finishing however, we then take the first db in the list and we call collStats on each of its collections and one of those collStats calls sometimes ends up failing, also due to a closed connection. The failures occur 10 - 15% of the time that the test runs and most of those are failures on listCollections. The relevant automation agent and mongod logs are attached.

How to Reproduce

I have not been able to reproduce this locally. It seems to only fail on the evergreen boxes where we run e2e tests (which might imply this is OS specific or maybe it only occurs when resource usage on a box is high?). Here is an example of one of the failed evergreen e2e tests: https://spruce.mongodb.com/task/mms_e2e_local_E2E_Local_Monitoring_DataExplorer_Read_Write_patch_c8ef0b4e6e5e627d655dd527362adb8d05a7d02f_624b0c86e3c33127d0d804db_22_04_04_15_19_37/tests?execution=0&sortBy=STATUS&sortDir=ASC

The test doesn't spin up any Atlas boxes, it simply runs an agent and a standalone on the current evergreen box. 

Additional Background

Here is the relevant snippet from the automation agent log of that failing test where it shows the listCollections command failing due to a closed connection:

[2022-04-21T00:27:49.230+0000] [.debug] [src/mongoctl/processctl.go:RunCommandWithTimeout:1132] <myStandalone> [00:27:49.230] Starting RunCommand(dbName=admin, cmd=[\{isMaster 1}]) to ip-10-122-63-219.ec2.internal:20100 (local=false) ...
[2022-04-21T00:27:49.230+0000] [.debug] [src/mongoclientservice/mongoclientservice.go:createClient:1129] [00:27:49.230] Attempting to connect to connParams = ip-10-122-63-219.ec2.internal:20100 (local=false), connectMode = SingleConnect, deadline = 2022-04-21 00:28:29.230260107 +0000 UTC m=+142.651182497, omitAtmUsers = false identitiesToTry: ([<nil>]), tryTls order: ([false])
[2022-04-21T00:27:49.230+0000] [.debug] [src/mongoclientservice/mongoclientservice.go:createClient:1186] [00:27:49.230] Trying without TLS for ip-10-122-63-219.ec2.internal:20100 (local=false). identity <nil>
[2022-04-21T00:27:49.230+0000] [.error] [src/mongoctl/processctl.go:runListCollections:2881] [00:27:49.230] Error running ListCollections with filter map[] : connection(ip-10-122-63-219.ec2.internal:20100[-79]) incomplete read of message header: read tcp 10.122.63.219:54110->10.122.63.219:20100: use of closed network connection
[2022-04-21T00:27:49.230+0000] [.error] [src/mongoctl/processctl.go:func1:2833] [00:27:49.230] Error getting collections for db config : [00:27:49.230] Error running ListCollections with filter map[] : connection(ip-10-122-63-219.ec2.internal:20100[-79]) incomplete read of message header: read tcp 10.122.63.219:54110->10.122.63.219:20100: use of closed network connection
[2022-04-21T00:27:49.230+0000] [.debug] [src/cachingresolver/cachingresolver.go:virtualDnsServerMainLoop:72] <Virtual DNS Server 204> [00:27:49.230] Started Virtual DNS Server 204
[2022-04-21T00:27:49.230+0000] [.error] [src/mongoctl/processctl.go:getCollectionsHelper:2861] [00:27:49.230] Error getting collections for db config : [00:27:49.230] Error executing WithClientFor() for cp=ip-10-122-63-219.ec2.internal:20100 (local=false) connectMode=SingleConnect identityUsed=<nil> : [00:27:49.230] Error getting collections for db config : [00:27:49.230] Error running ListCollections with filter map[] : connection(ip-10-122-63-219.ec2.internal:20100[-79]) incomplete read of message header: read tcp 10.122.63.219:54110->10.122.63.219:20100: use of closed network connection

And here is the corresponding mongod log:

2022-04-21T00:27:49.230+0000 D2 COMMAND [conn81] run command config.$cmd { listCollections: 1, filter: {}, nameOnly: true, cursor: {}, lsid: { id: UUID("61e7ec1e-9a4f-4132-bfc1-908c938e3c31") }, $db: "config" }
2022-04-21T00:27:49.230+0000 D2 NETWORK [conn80] Session from 10.122.63.219:54108 encountered a network error during SourceMessage: HostUnreachable: Connection closed by peer
2022-04-21T00:27:49.230+0000 D2 NETWORK [conn79] Session from 10.122.63.219:54106 encountered a network error during SourceMessage: HostUnreachable: Connection closed by peer
2022-04-21T00:27:49.230+0000 I COMMAND [conn81] command config.$cmd appName: "MongoDB Automation Agent v12.0.0.7500 (git: 4b476d0c841cee05d8e870a4264ab1106c9ed903)" command: listCollections { listCollections: 1, filter: {}, nameOnly: true, cursor: {}, lsid: { id: UUID("61e7ec1e-9a4f-4132-bfc1-908c938e3c31") }, $db: "config" } numYields:0 reslen:171 locks:{ ParallelBatchWriterMode: { acquireCount:
{ r: 1 }
}, ReplicationStateTransition: { acquireCount:
{ w: 1 }
}, Global: { acquireCount:
{ r: 1 }
}, Database: { acquireCount:
{ r: 1 }
}, Collection: { acquireCount:
{ r: 2 }
}, Mutex: { acquireCount:
{ r: 1 }
} } protocol:op_msg 0ms
2022-04-21T00:27:49.230+0000 I NETWORK [conn79] end connection 10.122.63.219:54106 (16 connections now open)
2022-04-21T00:27:49.230+0000 I NETWORK [conn80] end connection 10.122.63.219:54108 (17 connections now open)
2022-04-21T00:27:49.230+0000 D3 NETWORK [conn79] Cancelling outstanding I/O operations on connection to 10.122.63.219:54106
2022-04-21T00:27:49.230+0000 D2 NETWORK [conn82] Session from 10.122.63.219:54120 encountered a network error during SourceMessage: HostUnreachable: Connection closed by peer
2022-04-21T00:27:49.230+0000 D3 NETWORK [conn80] Cancelling outstanding I/O operations on connection to 10.122.63.219:54108
2022-04-21T00:27:49.230+0000 I NETWORK [conn82] end connection 10.122.63.219:54120 (15 connections now open)
2022-04-21T00:27:49.230+0000 D3 NETWORK [conn82] Cancelling outstanding I/O operations on connection to 10.122.63.219:54120
2022-04-21T00:27:49.230+0000 D2 NETWORK [conn81] Session from 10.122.63.219:54110 encountered a network error during SourceMessage: HostUnreachable: Connection closed by peer
2022-04-21T00:27:49.230+0000 I NETWORK [conn81] end connection 10.122.63.219:54110 (14 connections now open)
2022-04-21T00:27:49.230+0000 D3 NETWORK [conn81] Cancelling outstanding I/O operations on connection to 10.122.63.219:54110

 

Here is a link to the code that is being called which ends up failing due to a disconnect:
https://github.com/10gen/mms-automation/blob/745524a911e35371fd2258183ec36adc2f1aaf41/go_planner/src/com.tengen/cm/mongoctl/processctl.go#L2878

 
I noticed this in the listCollections docs page:

Client Disconnection
Starting in MongoDB 4.2, if the client that issued listCollections disconnects before the operation completes, MongoDB marks listCollections for termination using killOp.

While it doesn't sound like this would cause a disconnect, it is odd that we are seeing these errors mostly on version 4.2 (and once in a while on a higher version, but never on earlier versions). Maybe before 4.2 the disconnect was always an issue but the driver would reconnect and still be able to get the results of the listCollections command whereas after 4.2 it can't?
 
This docs page also mentions a few changes that were made to collStats starting in version 4.2: https://www.mongodb.com/docs/manual/reference/command/collStats/



 Comments   
Comment by Preston Vasquez [ 13/Jun/22 ]

The Go Driver team is fairly confident that bumping to Version > 1.9 will resolve the disconnects when running the ListCollections operation. Closing with the understanding that a new ticket may be required if these issue persist after version bump.

Comment by Vladimir Isakov (Inactive) [ 07/Jun/22 ]

preston.vasquez@mongodb.com fyi... the Automation team will be upgrading the driver as part of this epic: CLOUDP-119739. It looks like they are scheduled to do it in the 3rd quarter. So it could be a little while before we find out if this works.

Comment by Preston Vasquez [ 31/May/22 ]

Waiting to hear back on if the Cloud InTel team upgrades to 1.9 and if that resolves the disconnection issue.

Comment by Preston Vasquez [ 27/May/22 ]

vladimir.isakov@mongodb.com For this particularly issue, if our hypothesis is correct, 1.9 should be good enough.

Comment by Vladimir Isakov (Inactive) [ 26/May/22 ]

preston.vasquez@mongodb.com I will have to check how difficult it would be for us to upgrade, but if we were to upgrade to version 1.9 (which is likely a bit more tried and true than the absolute latest 1.10.0), do you think that would be good enough?

Comment by Preston Vasquez [ 26/May/22 ]

vladimir.isakov@mongodb.com, thank you for the extremely detailed response. From the logs, we notice that the (*Database).ListCollections method is not retried after a connection failure. As Benji suggests, this is unexpected behavior as retryable reads are enabled by default. In the 1.7 version of the driver, we attempt to connect to the server here. This connection involves a series of handshakes that do not retry by default, and as you can see any errors in doing this simply get propagated, disconnecting the underlying network connection. In 1.9 we handle this in a more robust way, retrying the connection and even ignoring this branch altogether unless necessary. This logic was introduced in GODRIVER-1827 and is unlikely to be backported as the work is pretty involved. Version 1.10.0 will be released next week, and if it's not too much of an inconvenience we recommend upgrading to this version to see if these issues are resolved.

Comment by Vladimir Isakov (Inactive) [ 06/May/22 ]

When you say “many databases and collections”, how many are created in the E2E test? I’m assuming this issue only occurs when the listCollections/collStats commands are highly intensive operations server-side.
This is the script we run to populate the dbs/collections. I have also been noticing this same error when we run an aggregation count command to count the number of documents in a collection in that same test. And I think I've also seen it fail when we run the listIndexes command on a collection, also in that same test.

Secondly, is this a new error? Do different versions of the Go driver affect the presence of the bug?
It's hard for me to tell which version of the go driver has this problem, but I can tell that the test we run in evergreen on master, which uses the go driver go.mongodb.org/mongo-driver v1.7.2-0.20210804202952-82d3467025bc fails, while the tests we run on the 5.0 ops manager release in evergreen, which use the same go driver don't seem to be failing. The only difference I can see is that our e2e test on master gets run on an Amazon2-cloud-large box while the ops manager 5.0 e2e test gets run on a rhel76-cloud-large box.

Thirdly, does this error happen when connected to a non-standalone server? If you ran the E2E test with a replicaset, for example, do you still see the error?
This failure occurs with a replica set as well, but more often with a standalone.

Fourthly (apologies for the inundation of questions), looking at the mongod log, I’m surprised the listCollections command is not retried. Retryable reads are enabled by default on our Clients, so I would expect to see another listCollections command reaching the server. Is there another one in the server logs before the network error?
Luckily I still had the full mongod log. I just attached it to the ticket so that you can take a look.

Finally, is there any reason to believe goctx, the context you pass into ListCollections, would have an unexpected deadline?
As far as I can tell, the context is created this way:

background = new(emptyCtx)

where an empty context is defined like this:

// An emptyCtx is never canceled, has no values, and has no deadline. It is not
// struct{}, since vars of this type must have distinct addresses.
type emptyCtx int

I hope something in my responses help figure this out. I'm sorry I can't be of much more help here. This problem is difficult to reproduce. It only seems to happen on the evergreen boxes where we run these e2e tests. Which makes me think it might somehow be related to the particular OS version on those boxes or something else about how they are configured. Would it be possible for you to run your tests directly on one of those boxes? I believe evergreen lets you spawn one of those boxes and ssh into it if you click on the link to the e2e test that I pasted in the description. Not sure if that's the right move... just a hunch.

Comment by Benji Rewis (Inactive) [ 02/May/22 ]

Apologies for the late response, vladimir.isakov@mongodb.com.

This is a tough one. Like you, I’m unable to reproduce the error locally against 3.6, 4.2 or 5.0 even when I create ~1000 databases each with 100 collections. I have a few questions that might lead to potential avenues of investigation:

When you say “many databases and collections”, how many are created in the E2E test? I’m assuming this issue only occurs when the listCollections/collStats commands are highly intensive operations server-side.

Secondly, is this a new error? Do different versions of the Go driver affect the presence of the bug?

Thirdly, does this error happen when connected to a non-standalone server? If you ran the E2E test with a replicaset, for example, do you still see the error?

Fourthly (apologies for the inundation of questions), looking at the mongod log, I’m surprised the listCollections command is not retried. Retryable reads are enabled by default on our Clients, so I would expect to see another listCollections command reaching the server. Is there another one in the server logs before the network error?

Finally, is there any reason to believe goctx, the context you pass into ListCollections, would have an unexpected deadline?

Right now, the fact that MongoDB marks rogue listCollections for termination on 4.2+ seems like a red herring to me. I don’t think that would cause the initial client-side disconnect.

Comment by Benji Rewis (Inactive) [ 26/Apr/22 ]

Thanks for your report, vladimir.isakov@mongodb.com ! I'll take a look into this soon.

Generated at Thu Feb 08 08:38:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.