[GODRIVER-1920] Aggregate can time out without setting socket options Created: 15/Mar/21 Updated: 30/Mar/22 |
|
| Status: | Backlog |
| Project: | Go Driver |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | Benji Rewis (Inactive) | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Documentation Changes: | Not Needed |
| Description |
|
It seems that aggregate can occasionally cause a timeout with a tcp error like This operation should not timeout, as the default socket options are supposed to block indefinitely. Some timeout, either default or manually-set, is causing this failure. |
| Comments |
| Comment by Stephen White [ 15/Sep/21 ] |
|
An interesting point discovered in the https://jira.mongodb.org/browse/MHOUSE-2004 investigation, was that the macOS specific issue only happened when the VPN was active on corp issued laptops. It appears that the VPN may be filtering TCP keepalive packets. |
| Comment by Benji Rewis (Inactive) [ 19/Mar/21 ] |
|
Moving this back to Open, as the investigation did not yield anything productive. To summarize, all we've learned is that the early timeouts seem to happen only on MacOS when connected to the particular data lake listed in the original code. It may be an issue with MacOS TCP keepalive configuration on that data lake. |
| Comment by Benji Rewis (Inactive) [ 18/Mar/21 ] |
|
Hello david.golden ! We haven’t found a root cause yet in the Go driver. I’ve been able to reproduce the early timeout with the code you provided, but it’s unclear to me if/where that timeout is getting set in the Go driver. Hypothetically, a read timeout would be set here in readWireMessage, but from what I can tell, the only non-zero (a zero timeout means block infinitely) timeout that is set is 10s for heartbeats, which is not the cause of the larger timeout. I’ve been testing on MacOS, and I find it odd that Linux doesn’t have the same issue. That made me think it’s some complication with a Mac tcp keep alive configuration (see Mac tcp options here). The issue persists on Golang version 1.14.15, 1.15.15 and 1.16.2 (latest) which means that upgrading your Go version will probably not help. Interestingly, I cannot reproduce the error on a local replica set. I set a fail point to block an aggregate for 10 minutes, and the query still succeeds. I’ve seen similar issues in the TOOLS repo ( |
| Comment by David Golden [ 15/Mar/21 ] |
|
Notes on the situation that led me to ask on Slack:
I can provide credentials to the data lake if that's helpful for repro. |