Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-2712

Retry commands that fail because of the selected server being concurrently removed from the topology

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Unknown Unknown
    • None
    • Component/s: Retryability
    • Labels:
      None
    • Needed

      Summary

      What is the problem or use case, what are we trying to achieve?

      As a result of SDAM and SRV records polling, a driver may remove a server from the topology. The SDAM spec requires the server monitor to halt as a result of removal. While it does not specify anything regarding the connection pool for the server, it's only reasonable to close all of its connections and make sure the pool does not create new ones. This renders the server unusable, which is a natural expectation for a server that was removed from the topology.

      Another observation is that a driver may try and use a removed server to execute a command: a server selected for executing a command may be concurrently removed from the topology and made unusable. Obviously, this race condition may result in the command failing (the Java driver throws MongoServerUnavailableException). It seems reasonable to treat such a failure as retryable (see retryable reads, writes), or, if it happened for a change stream, resumable. As far as I can see, there is no specification that requires it to be treated this way. The Java driver treats MongoServerUnavailableException as resumable, but not as retryable.

      Motivation

      Who is the affected end user?

      Who are the stakeholders?

      Driver users.

      How does this affect the end user?

      Are they blocked? Are they annoyed? Are they confused?

      A user sees retryable operations not being retried in a situation when it seems reasonable to retry.

      How likely is it that this problem or use case will occur?

      Main path? Edge case?

      The described failure is usually less likely to happen than other kinds of failures that the specifications require to be retryable/resumable, however it still happens and users notice: https://jira.mongodb.org/browse/JAVA-5119.

      If the problem does occur, what are the consequences and how severe are they?

      Minor annoyance at a log message? Performance concern? Outage/unavailability? Failover can't complete?

      Not severe. A user observe failures that a driver could have hidden by retrying.

      Is this issue urgent?

      Does this ticket have a required timeline? What is it?

      No.

      Is this ticket required by a downstream team?

      Needed by e.g. Atlas, Shell, Compass?

      No.

      Is this ticket only for tests?

      Does this ticket have any functional impact, or is it just test improvements?

      No.

      Acceptance Criteria

      What specific requirements must be met to consider the design phase complete?

      We must introduce unified/prose tests for the improvement.

            Assignee:
            Unassigned Unassigned
            Reporter:
            valentin.kovalenko@mongodb.com Valentin Kavalenka
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: