[DRIVERS-2740] Add support for polling SRV records for mongod discovery Created: 03/Oct/23 Updated: 04/Oct/23 |
|
| Status: | Backlog |
| Project: | Drivers |
| Component/s: | SRV Polling |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Unknown |
| Reporter: | Bob Tiernay | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Driver Changes: | Needed | ||||||||||||
| Description |
SummaryAs mentioned in
In practice, we've hit a few issues wherein we are using MongoDB Atlas with a non-sharded cluster, behind PrivateLink endpoints. On occasion, the private endpoints need to be updated or changed. When this happens, many of our NodeJS applications start failing and never recover due to the aforementioned limitations. The only solution is to restart the application, which reduces availability and introduces operational overhead to resolve. Ideally, non-sharded clusters could benefit from the current dynamic updates from polling the SRV records for changes.
MotivationWho is the affected end user?Users who use non-sharded clusters and who make changes to SRV while applications are running. How does this affect the end user?Are they blocked? Are they annoyed? Are they confused? Loss of application availability and increased operational effort to resolve. How likely is it that this problem or use case will occur?The likelihood is reasonably high for users that have many environments which change their infrastructure or perform migrations frequently. If the problem does occur, what are the consequences and how severe are they?It is very difficult to debug as its not clear in the logs. This required consultation with MongoDB Atlas experts in order to troubleshoot. Is this issue urgent?Not urgent, but highly desired. Is this ticket required by a downstream team?No Is this ticket only for tests?No Acceptance CriteriaSVR polling and resolution semantics is available for non-sharded clusters. The following is an (untested) example of how this might be able to be achieved today using the NodeJS native driver:
However, as noted this uses internal implementation details that may change on subsequent releases. Thus, first class support is preferred. |
| Comments |
| Comment by Alex Bevilacqua [ 04/Oct/23 ] |
For multithreaded drivers there would be a monitoring thread for the SRV record and a separate monitoring thread for SDAM, which could conceivably result in a race condition between monitors. |
| Comment by Bob Tiernay [ 04/Oct/23 ] |
|
Got it. What if anything is the downside to performing this in addition to the natural SDAM behavior for replica sets? |
| Comment by Alex Bevilacqua [ 04/Oct/23 ] |
SRV resolution for replica sets provides the initial seed list, however server discovery will be performed as part of steady-state monitoring. For example if a 3 member replica set were to add 2 members, a subsequent monitoring event (heartbeat) would include that information automatically, and the drivers can update their representation of the cluster topology based on that information. There is no mechanism for drivers to discover changes to the available mongos', which is why polling SRV was scoped to sharded clusters only. |
| Comment by Bob Tiernay [ 04/Oct/23 ] |
Correct, but SRV resolution only happens on initial discovery. After that it never happens again. How is the lack of need for this capability different that from Mongos rediscovery? |
| Comment by Alex Bevilacqua [ 04/Oct/23 ] |
|
Hi rtiernay@gmail.com, and thanks for opening this ticket.
Polling SRV for mongos discovery was scoped exclusively to sharded clusters as replica sets already had a discovery mechanism within the drivers in the form of server discovery and monitoring.
The issue you've described is an edge case that occurs as a result of how private endpoints are being managed, how replica set reconfiguration works and how server discovery/monitoring works, however I'm looking into this in greater detail for the time being to determine how best to proceed. |
| Comment by Bob Tiernay [ 03/Oct/23 ] |
|
One thing I failed to ask in the ticket description is why this limitation exists in the first place. I assume there was some rationale, but that is not reflected in any specification or Jira that I can find. At a minimum, updating these documents to reflect that rationale would be greatly appreciated for those who are in a similar situation. |
| Comment by Bob Tiernay [ 03/Oct/23 ] |
|
Please note that this Jira may assist with https://jira.mongodb.org/browse/DRIVERS-910 for clusters in transition to Mongo 8 which only supports sharded clusters. |