[DRIVERS-1571] Direct read/write retries to another mongos if possible Created: 19/Feb/21  Updated: 06/Feb/24

Status: Implementing
Project: Drivers
Component/s: Performance, Retryability, Server Selection
Fix Version/s: None

Type: Epic Priority: Major - P3
Reporter: Oleg Pudeyev (Inactive) Assignee: Dmitry Rybakov
Resolution: Unresolved Votes: 6
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-53287 Improve cluster/mongos health observa... Closed
Documented
Issue split
split to JAVA-4254 Direct retries to another mongos if o... Backlog
split to CSHARP-3757 Direct read/write retries to another ... Backlog
split to PHPC-1911 Direct read/write retries to another ... Backlog
split to CXX-2320 Direct read/write retries to another ... Blocked
split to CDRIVER-4099 Direct read/write retries to another ... In Code Review
split to GODRIVER-2101 Direct read/write retries to another ... Closed
split to MOTOR-792 Direct read/write retries to another ... Closed
split to NODE-3470 Direct read/write retries to another ... Closed
split to PYTHON-2834 Direct read/write retries to another ... Closed
split to RUBY-2748 Direct read/write retries to another ... Closed
split to RUST-935 Direct read/write retries to another ... Closed
Problem/Incident
Related
related to SERVER-50459 Include "source" field in error respo... Backlog
related to DRIVERS-2828 Update prose tests for mongos deprior... In Review
is related to DRIVERS-1842 Drivers should retry authentication e... Backlog
is related to DRIVERS-2140 Clarify Auth Spec and Clean Up Error ... Backlog
Driver Changes: Not Needed
Server Compat: 4.4, 5.0, 5.3
Quarter: FY24Q1, FY24Q2, FY24Q3
Upstream Changes Summary:

Details TBD

Downstream Changes Summary:

Drivers should implement server selection and read/write retry mechanisms changes, as well as new prose tests: specifications@86d961f

Case:
Engineering Lead: Jeffrey Yemin Jeffrey Yemin
Product Manager: Alex Bevilacqua Alex Bevilacqua
Program Manager: Tom Selander Tom Selander
Start date:
Scope Cost Estimate: 0
Cost to Date: 0
Final Cost Estimate: 0
Cost Threshold %: 100
Detailed Project Statuses:

Engineer: Dmitry Rybakov
Summary: When encountering a retryable error, direct the retry attempt to a different mongos if possible.

2023-06-23

  • Design approved
  • Ready for implementation in Q3 with Go

2023-06-09

  • Design approved
  • Changes will be ported to spec repo

2023-05-12

  • Design work started
  • Decided against adapting unified test format to accommodate special test needs for this project due to implementation complexity involved
Driver Compliance:
Key Status/Resolution FixVersion
CDRIVER-4099 In Code Review 1.26.0
CXX-2320 Blocked 3.10.0
CSHARP-3757 Backlog
GODRIVER-2101 Fixed 1.13.0, 1.13.1
JAVA-4254 Backlog
NODE-3470 Fixed 6.4.0
MOTOR-792 Duplicate
PYTHON-2834 Fixed 4.7
PHPC-1911 Backlog
RUBY-2748 Fixed 2.19.4
RUST-935 Fixed 2.8.0
SWIFT-1279 Won't Do

 Description   

There are several scenarios in which it would be useful to redirect reads or writes to a different mongos.

  1. A MongoDB sharded cluster deployment may find itself in a situation when a mongos reports itself as being healthy but is unable to execute any queries. The driver has attempted to retry the failing queries, but in a number of cases selected the same mongos that failed in the first place which caused the retry to also fail (for the same reason as the original attempt) and be propagated to the application.
  2. Currently when the driver is in sharded topology, server selection spec requires a random server to be selected for each operation. This permits the same failed mongos to be selected for both an operation and its retry, with the result that the query fails, even when there are healthy mongoses in the deployment that could have successfully executed the query.

The suggested improvement is for the driver, when in sharded cluster topology, to:

  • Track whether a server selection request is for the first attempt or for a retry,
  • Track the server used for the first attempt,
  • When selecting the server for the retry, if there are multiple eligible mongoses, select randomly from mongoses other than the one used for the first attempt.
  • bonus nice to have: determine if a mongos is healthy before making said attempt and if unhealthy, exclude from selection

Cast of Characters:
Product Manager for Feature: alex.bevilacqua@mongodb.com
Program Manager: tom.selander@mongodb.com
Engineering Lead: dmitry.rybakov@mongodb.com



 Comments   
Comment by Githook User [ 08/Sep/23 ]

Author:

{'name': 'Dmitry Rybakov', 'email': 'dmitry.rybakov@mongodb.com', 'username': 'comandeo'}

Message: DRIVERS-1571 Fix changelog entries (#1456)
Branch: master
https://github.com/mongodb/specifications/commit/edf51dc4fdf8bb4f4dab3f41e799a53f49c26c8e

Comment by Githook User [ 25/Aug/23 ]

Author:

{'name': 'Dmitry Rybakov', 'email': 'dmitry.rybakov@mongodb.com', 'username': 'comandeo'}

Message: DRIVERS-1571 Retry on different mongos when possible (#1450)

Co-authored-by: Alex Bevilacqua <alex@alexbevi.com>
Co-authored-by: Preston Vasquez <prestonvasquez@icloud.com>
Branch: master
https://github.com/mongodb/specifications/commit/86d961fee4c5e92fbcff76a62abe1aea3fafd451

Comment by Jeffrey Yemin [ 16/Jun/22 ]

SPEC-1555 seems related to this issue, as over time that feature will help the driver to route even the initial command to a mongos that is experiencing less queuing.

Comment by Oleg Pudeyev (Inactive) [ 07/Jul/21 ]

If the question is about errors returned for operations, I don't believe there is one.

Server selection does not presently take into account errors that happened on a particular connection or server. Given two mongoses in a sharded cluster, each will be equally likely to be chosen (for each new operation or retry) even if one is failing every single operation with any error code.

However if the mongos responds with an error to ismaster then it should be taken out of usage by the drivers.

Comment by Jeffrey Yemin [ 23/Feb/21 ]

Also, note that it will not interact well with DRIVERS-720.

Generated at Thu Feb 08 08:23:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.