[DRIVERS-2740] Add support for polling SRV records for mongod discovery Created: 03/Oct/23  Updated: 04/Oct/23

Status: Backlog
Project: Drivers
Component/s: SRV Polling
Fix Version/s: None

Type: New Feature Priority: Unknown
Reporter: Bob Tiernay Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to DRIVERS-910 Allow MongoClient to automatically tr... Backlog
is related to DRIVERS-561 Support polling SRV records for mongo... Closed
Driver Changes: Needed

 Description   

Summary

As mentioned in DRIVERS-561, the SRV polling specification currently disallows non-sharded clusters from participating in dynamic reconfiguration of the client's topology on SRV record canges:

This feature is only available when the Server Discovery has determined that the TopologyType is Sharded, or Unknown. Drivers MUST NOT rescan SRV DNS records when the Topology is not Sharded (i.e. Single, ReplicaSetNoPrimary, or ReplicaSetWithPrimary).

In practice, we've hit a few issues wherein we are using MongoDB Atlas with a non-sharded cluster, behind PrivateLink endpoints. On occasion, the private endpoints need to be updated or changed. When this happens, many of our NodeJS applications start failing and never recover due to the aforementioned limitations.  The only solution is to restart the application, which reduces availability and introduces operational overhead to resolve. 

Ideally, non-sharded clusters could benefit from the current dynamic updates from polling the SRV records for changes.

 

Motivation

Who is the affected end user?

Users who use non-sharded clusters and who make changes to SRV while applications are running.

How does this affect the end user?

Are they blocked? Are they annoyed? Are they confused?

Loss of application availability and increased operational effort to resolve.

How likely is it that this problem or use case will occur?

The likelihood is reasonably high for users that have many environments which change their infrastructure or perform migrations frequently.

If the problem does occur, what are the consequences and how severe are they?

It is very difficult to debug as its not clear in the logs. This required consultation with MongoDB Atlas experts in order to troubleshoot.

Is this issue urgent?

Not urgent, but highly desired.

Is this ticket required by a downstream team?

No

Is this ticket only for tests?

No

Acceptance Criteria

SVR polling and resolution semantics is available for non-sharded clusters.

The following is an (untested) example of how this might be able to be achieved today using the NodeJS native driver:

 
// Create client with SRV+seedlist connection string
const uri = "mongodb+srv://<prefix>.mongodb.net/auth0?retryWrites=false&appName=server";
const client = new MongoClient(uri);
 
// Connect to the MongoDB cluster
await client.connect();
 
// As a workaround for https://github.com/mongodb/specifications/blob/master/source/polling-srv-records-for-mongos-discovery/polling-srv-records-for-mongos-discovery.rst#implementation
// See https://jira.mongodb.org/browse/DRIVERS-561
const srvPoller = client.s.srvPoller;
 
// Could be undefined non-SRV-based config so we shouldn't assume generally
if (srvPoller) {
  // Reproduce what happens for sharded clusters in Topology#detectShardedTopology
  srvPoller.on(SrvPoller.SRV_RECORD_DISCOVERY, client.s.detectSrvRecords);
  srvPoller.start();
}

However, as noted this uses internal implementation details that may  change on subsequent releases. Thus, first class support is preferred.



 Comments   
Comment by Alex Bevilacqua [ 04/Oct/23 ]

Got it. What if anything is the downside to performing this in addition to the natural SDAM behavior for replica sets?

For multithreaded drivers there would be a monitoring thread for the SRV record and a separate monitoring thread for SDAM, which could conceivably result in a race condition between monitors.

Comment by Bob Tiernay [ 04/Oct/23 ]

Got it. What if anything is the downside to performing this in addition to the natural SDAM behavior for replica sets?

Comment by Alex Bevilacqua [ 04/Oct/23 ]

Correct, but SRV resolution only happens on initial discovery. After that it never happens again. How is the lack of need for this capability different that from Mongos rediscovery?

SRV resolution for replica sets provides the initial seed list, however server discovery will be performed as part of steady-state monitoring. For example if a 3 member replica set were to add 2 members, a subsequent monitoring event (heartbeat) would include that information automatically, and the drivers can update their representation of the cluster topology based on that information.

There is no mechanism for drivers to discover changes to the available mongos', which is why polling SRV was scoped to sharded clusters only.

Comment by Bob Tiernay [ 04/Oct/23 ]

Polling SRV for mongos discovery was scoped exclusively to sharded clusters as replica sets already had a discovery mechanism within the drivers in the form of server discovery and monitoring.

Correct, but SRV resolution only happens on initial discovery. After that it never happens again. How is the lack of need for this capability different that from Mongos rediscovery?

Comment by Alex Bevilacqua [ 04/Oct/23 ]

Hi rtiernay@gmail.com, and thanks for opening this ticket.

One thing I failed to ask in the ticket description is why this limitation exists in the first place. I assume there was some rationale, but that is not reflected in any specification or Jira that I can find. At a minimum, updating these documents to reflect that rationale would be greatly appreciated for those who are in a similar situation.

Polling SRV for mongos discovery was scoped exclusively to sharded clusters as replica sets already had a discovery mechanism within the drivers in the form of server discovery and monitoring.

On occasion, the private endpoints need to be updated or changed. When this happens, many of our NodeJS applications start failing and never recover due to the aforementioned limitations.

The issue you've described is an edge case that occurs as a result of how private endpoints are being managed, how replica set reconfiguration works and how server discovery/monitoring works, however I'm looking into this in greater detail for the time being to determine how best to proceed.

Comment by Bob Tiernay [ 03/Oct/23 ]

One thing I failed to ask in the ticket description is why this limitation exists in the first place. I assume there was some rationale, but that is not reflected in any specification or Jira that I can find. At a minimum, updating these documents to reflect that rationale would be greatly appreciated for those who are in a similar situation.

Comment by Bob Tiernay [ 03/Oct/23 ]

Please note that this Jira may assist with https://jira.mongodb.org/browse/DRIVERS-910 for clusters in transition to Mongo 8 which only supports sharded clusters.

Generated at Thu Feb 08 08:26:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.