[GODRIVER-1920] Aggregate can time out without setting socket options Created: 15/Mar/21  Updated: 30/Mar/22

Status: Backlog
Project: Go Driver
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Benji Rewis (Inactive) Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Documentation Changes: Not Needed

 Description   

It seems that aggregate can occasionally cause a timeout with a tcp error like
unable to decode message length: read tcp 10.1.129.128:65170->3.216.165.122:27017: read: operation timed out
even when no options related to timeouts have been set.

This operation should not timeout, as the default socket options are supposed to block indefinitely. Some timeout, either default or manually-set, is causing this failure.



 Comments   
Comment by Stephen White [ 15/Sep/21 ]

An interesting point discovered in the https://jira.mongodb.org/browse/MHOUSE-2004 investigation, was that the macOS specific issue only happened when the VPN was active on corp issued laptops. It appears that the VPN may be filtering TCP keepalive packets.

Comment by Benji Rewis (Inactive) [ 19/Mar/21 ]

Moving this back to Open, as the investigation did not yield anything productive. To summarize, all we've learned is that the early timeouts seem to happen only on MacOS when connected to the particular data lake listed in the original code. It may be an issue with MacOS TCP keepalive configuration on that data lake.

Comment by Benji Rewis (Inactive) [ 18/Mar/21 ]

Hello david.golden !

We haven’t found a root cause yet in the Go driver. I’ve been able to reproduce the early timeout with the code you provided, but it’s unclear to me if/where that timeout is getting set in the Go driver.

Hypothetically, a read timeout would be set here in readWireMessage, but from what I can tell, the only non-zero (a zero timeout means block infinitely) timeout that is set is 10s for heartbeats, which is not the cause of the larger timeout.

I’ve been testing on MacOS, and I find it odd that Linux doesn’t have the same issue. That made me think it’s some complication with a Mac tcp keep alive configuration (see Mac tcp options here).

The issue persists on Golang version 1.14.15, 1.15.15 and 1.16.2 (latest) which means that upgrading your Go version will probably not help.

Interestingly, I cannot reproduce the error on a local replica set. I set a fail point to block an aggregate for 10 minutes, and the query still succeeds.

I’ve seen similar issues in the TOOLS repo (TOOLS-638), but they very clearly involved sessions/transactions, which this issue does not.

Comment by David Golden [ 15/Mar/21 ]

Notes on the situation that led me to ask on Slack:

  • Client connected to an Atlas Data Lake in the Cloud staging environment
  • Driver version was v1.1.3 – this was not intentional, somehow this is what Go found in my environment. I'll recheck with a newer version
    • Same result with v1.5.0
  • Code is here
  • I don't know how long it took before the timeout. I'll add instrumentation and recheck.
    • Timeout occurs around 5min 25-27sec (note, no pre-warm pool, so variability could be related to connection/handshake time)
  • This was on Mac OS X, Big Sur, Intel, Go 1.14.15; I'll try on linux, too.
    • On Linux (Go 1.14.15), I did not see timeouts. All queries eventually got responses over 27 minutes.

I can provide credentials to the data lake if that's helpful for repro.

Generated at Thu Feb 08 08:37:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.