[GODRIVER-2390] repeated listCollections & collStats causing connection disconnect Created: 22/Apr/22 Updated: 13/Jun/22 Resolved: 13/Jun/22 |
|
| Status: | Closed |
| Project: | Go Driver |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Vladimir Isakov (Inactive) | Assignee: | Preston Vasquez |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Description |
SummaryThe Cloud InTel team has a feature called Data Explorer which uses the automation agent to send commands down to the mongod. When this feature runs on an evergreen box with a standalone mongod 4.2 (mostly fails on a 4.2 standalone but once in a while also fails on newer versions of mongodb and even on a replica set) and an automation agent (which uses a go driver version v1.7.2-0.20210804202952-82d3467025bc) it ends up causing the agent to lose its connection to the mongod when it issues particular commands. Specifically, we end up populating the mongod with many databases and collections via a js script, then the agent iterates through all of the dbs and for each db it calls listCollections on the mongod. Sometimes one of those listCollections ends up failing due to a closed connection. In the situations when the listCollections end up finishing however, we then take the first db in the list and we call collStats on each of its collections and one of those collStats calls sometimes ends up failing, also due to a closed connection. The failures occur 10 - 15% of the time that the test runs and most of those are failures on listCollections. The relevant automation agent and mongod logs are attached. How to ReproduceI have not been able to reproduce this locally. It seems to only fail on the evergreen boxes where we run e2e tests (which might imply this is OS specific or maybe it only occurs when resource usage on a box is high?). Here is an example of one of the failed evergreen e2e tests: https://spruce.mongodb.com/task/mms_e2e_local_E2E_Local_Monitoring_DataExplorer_Read_Write_patch_c8ef0b4e6e5e627d655dd527362adb8d05a7d02f_624b0c86e3c33127d0d804db_22_04_04_15_19_37/tests?execution=0&sortBy=STATUS&sortDir=ASC The test doesn't spin up any Atlas boxes, it simply runs an agent and a standalone on the current evergreen box. Additional BackgroundHere is the relevant snippet from the automation agent log of that failing test where it shows the listCollections command failing due to a closed connection:
And here is the corresponding mongod log:
Here is a link to the code that is being called which ends up failing due to a disconnect:
While it doesn't sound like this would cause a disconnect, it is odd that we are seeing these errors mostly on version 4.2 (and once in a while on a higher version, but never on earlier versions). Maybe before 4.2 the disconnect was always an issue but the driver would reconnect and still be able to get the results of the listCollections command whereas after 4.2 it can't? |
| Comments |
| Comment by Preston Vasquez [ 13/Jun/22 ] | ||||
|
The Go Driver team is fairly confident that bumping to Version > 1.9 will resolve the disconnects when running the ListCollections operation. Closing with the understanding that a new ticket may be required if these issue persist after version bump. | ||||
| Comment by Vladimir Isakov (Inactive) [ 07/Jun/22 ] | ||||
|
preston.vasquez@mongodb.com fyi... the Automation team will be upgrading the driver as part of this epic: CLOUDP-119739. It looks like they are scheduled to do it in the 3rd quarter. So it could be a little while before we find out if this works. | ||||
| Comment by Preston Vasquez [ 31/May/22 ] | ||||
|
Waiting to hear back on if the Cloud InTel team upgrades to 1.9 and if that resolves the disconnection issue. | ||||
| Comment by Preston Vasquez [ 27/May/22 ] | ||||
|
vladimir.isakov@mongodb.com For this particularly issue, if our hypothesis is correct, 1.9 should be good enough. | ||||
| Comment by Vladimir Isakov (Inactive) [ 26/May/22 ] | ||||
|
preston.vasquez@mongodb.com I will have to check how difficult it would be for us to upgrade, but if we were to upgrade to version 1.9 (which is likely a bit more tried and true than the absolute latest 1.10.0), do you think that would be good enough? | ||||
| Comment by Preston Vasquez [ 26/May/22 ] | ||||
|
vladimir.isakov@mongodb.com, thank you for the extremely detailed response. From the logs, we notice that the (*Database).ListCollections method is not retried after a connection failure. As Benji suggests, this is unexpected behavior as retryable reads are enabled by default. In the 1.7 version of the driver, we attempt to connect to the server here. This connection involves a series of handshakes that do not retry by default, and as you can see any errors in doing this simply get propagated, disconnecting the underlying network connection. In 1.9 we handle this in a more robust way, retrying the connection and even ignoring this branch altogether unless necessary. This logic was introduced in GODRIVER-1827 and is unlikely to be backported as the work is pretty involved. Version 1.10.0 will be released next week, and if it's not too much of an inconvenience we recommend upgrading to this version to see if these issues are resolved. | ||||
| Comment by Vladimir Isakov (Inactive) [ 06/May/22 ] | ||||
|
When you say “many databases and collections”, how many are created in the E2E test? I’m assuming this issue only occurs when the listCollections/collStats commands are highly intensive operations server-side. Secondly, is this a new error? Do different versions of the Go driver affect the presence of the bug? Thirdly, does this error happen when connected to a non-standalone server? If you ran the E2E test with a replicaset, for example, do you still see the error? Fourthly (apologies for the inundation of questions), looking at the mongod log, I’m surprised the listCollections command is not retried. Retryable reads are enabled by default on our Clients, so I would expect to see another listCollections command reaching the server. Is there another one in the server logs before the network error? Finally, is there any reason to believe goctx, the context you pass into ListCollections, would have an unexpected deadline?
where an empty context is defined like this:
I hope something in my responses help figure this out. I'm sorry I can't be of much more help here. This problem is difficult to reproduce. It only seems to happen on the evergreen boxes where we run these e2e tests. Which makes me think it might somehow be related to the particular OS version on those boxes or something else about how they are configured. Would it be possible for you to run your tests directly on one of those boxes? I believe evergreen lets you spawn one of those boxes and ssh into it if you click on the link to the e2e test that I pasted in the description. Not sure if that's the right move... just a hunch. | ||||
| Comment by Benji Rewis (Inactive) [ 02/May/22 ] | ||||
|
Apologies for the late response, vladimir.isakov@mongodb.com. This is a tough one. Like you, I’m unable to reproduce the error locally against 3.6, 4.2 or 5.0 even when I create ~1000 databases each with 100 collections. I have a few questions that might lead to potential avenues of investigation: When you say “many databases and collections”, how many are created in the E2E test? I’m assuming this issue only occurs when the listCollections/collStats commands are highly intensive operations server-side. Secondly, is this a new error? Do different versions of the Go driver affect the presence of the bug? Thirdly, does this error happen when connected to a non-standalone server? If you ran the E2E test with a replicaset, for example, do you still see the error? Fourthly (apologies for the inundation of questions), looking at the mongod log, I’m surprised the listCollections command is not retried. Retryable reads are enabled by default on our Clients, so I would expect to see another listCollections command reaching the server. Is there another one in the server logs before the network error? Finally, is there any reason to believe goctx, the context you pass into ListCollections, would have an unexpected deadline? Right now, the fact that MongoDB marks rogue listCollections for termination on 4.2+ seems like a red herring to me. I don’t think that would cause the initial client-side disconnect. | ||||
| Comment by Benji Rewis (Inactive) [ 26/Apr/22 ] | ||||
|
Thanks for your report, vladimir.isakov@mongodb.com ! I'll take a look into this soon. |