[DRIVERS-1968] Single-threaded SDAM should construct a handshake command when reconnecting after a monitoring or application error/timeout Created: 29/Oct/21  Updated: 30/Nov/21  Resolved: 30/Nov/21

Status: Closed
Project: Drivers
Component/s: Load Balancer, SDAM, Serverless
Fix Version/s: None

Type: Spec Change Priority: Unknown
Reporter: Jeremy Mikola Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to CDRIVER-4207 mongoc_topology_scanner_node_t.last_f... Closed
is related to PHPLIB-717 Test Serverless behind a load balance... Closed
Driver Changes: Not Needed

 Description   

Summary

While investigating CLOUDP-104364, it was mentioned that the Atlas serverless proxy only outputs serviceId for the initial handshake, which is determined by the existence of a client field in the hello command.

This uncovered a subtle bug in libmongoc (CDRIVER-4207). Errors and timeouts encountered during monitoring would typically result in libmongoc constructing a handshake command when reconnecting; however, errors/timeouts encountered during application usage do not. In single-threaded SDAM, the monitoring and application sockets are one and the same.

Some suggestions for this ticket:

  • If not already noted in the SDAM spec, single-threaded implementations consider monitoring and application errors the same for purposes of sending a handshake hello during reconnection.
  • Consider noting the Atlas serverless proxy behavior in some drivers specification. I realize we don't have a serverless spec (just Serverless Testing), but perhaps this warrants a note in either the SDAM or Handshake specs.

Motivation

Who is the affected end user?

Single-threaded SDAM implementations.

How does this affect the end user?

Users connected to Atlas Serverless that encounter an application error/timeout might get blocked by client-side load balancer errors due to a missing serviceId in subsequent hello responses.

How likely is it that this problem or use case will occur?

This may occur after any application error/timeout in a single-threaded driver connected to Atlas Serverless.

If the problem does occur, what are the consequences and how severe are they?

Application will likely be blocked for the lifetime of the MongoClient.

Is this issue urgent?

The spec change itself is not urgent and does not block libmongoc from fixing its bug.

Is this ticket required by a downstream team?

No.

Is this ticket only for tests?

No, but it does affect serverless testing since our spec tests trigger socket errors via fail points (see: PHPLIB-717).



 Comments   
Comment by Jeremy Mikola [ 30/Nov/21 ]

I spoke to kevin.albertson about this and CDRIVER-4207 is sufficient to get this resolved in libmongoc. There should be no need for any spec clarification.

Generated at Thu Feb 08 08:24:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.