[DRIVERS-1397] Retryable reads for Atlas Search Created: 15/Sep/20  Updated: 22/Sep/21

Status: Blocked
Project: Drivers
Component/s: mongot, Retryability
Fix Version/s: None

Type: Epic Priority: Major - P3
Reporter: James Akins (Inactive) Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: mongot-cross-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Gantt Dependency
Related
Driver Changes: Needed

 Description   
Epic Summary

Summary
Drivers implement a special version of retryable reads with a new feature that tries a different node if there is an error

Motivation
Customers who use Atlas Search have no ability to avoid user facing errors related to mongot failures or common Atlas maintenance activities such as node replacements. This feature will improve the user experience and adoption of Atlas Search since users will experience fewer errors.

Cast of Characters

Engineering Lead:
Document Author:
POCers:
Product Owner:
Program Manager:
Stakeholders:

Documentation

[Scope Document|some.url]
[Technical Design Document|some.url]



 Comments   
Comment by Marcus Eagan (Inactive) [ 19/Oct/20 ]

A bit of business-minded context here:

The lack of retryable reads for a feature used to query MongoDB for data contradicts some of our strongest selling points around MongoDB, like high availability and supporting development of micro services. The lack of retryable reads might make a company looking to adopt DevOps hesitant.

Does anyone here have any thoughts about a meaningful work around?

Comment by Bernie Hackett [ 15/Sep/20 ]

Some notes on the technical problem.

  • When Atlas search is enabled every mongod has an associated mongot instance (even in a sharded cluster). Each mongot is an instance of Lucene.
  • Atlas search can be added to any existing deployment after the fact. Once you enable search it takes some time for a mongot instance to be added for each mongod and for the search indexes to be created.
  • A mongod can accept a search query even though its associated mongot isn't currently available (down, creating indexes, etc.). The query will fail in this case.
  • It is not acceptable to mark the mongod down/unknown if a search query fails. Applications generally only use search for a small subset of queries and other queries are expected to work without issue.
  • mongot health is not communicated through mongod, so there is nothing to monitor through ismaster/hello calls.

After a number of potential solutions were discussed the only solution that seemed reasonable was for mongod to return a retryable error that specified the retry should happen on a different server. This seems like a change to SDAM, Server Selection, and Retryable Reads, and will require changes in mongos to implement new versions of those specs and mongod to communicate a new retryable error type..

Comment by Bernie Hackett [ 15/Sep/20 ]

+rachelle.palmer

Comment by James Akins (Inactive) [ 15/Sep/20 ]

Note: the Atlas Search team would like the Drivers team to review this project for consideration for their FY21Q4 plan. In the meantime, please let us know if there are any questions we can help answer; thanks!

 

cc: doug.tarr kevin.rosendahl marcus.eagan 

Generated at Thu Feb 08 08:23:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.