[SERVER-63333] Attach retryable error label to TemporarilyUnavailable error code in a Serverless environment Created: 07/Feb/22  Updated: 23/May/23  Resolved: 23/May/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Louis Williams Assignee: Backlog - Storage Execution Team
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-60839 Introduce a TemporarilyUnavailable er... Closed
Assigned Teams:
Storage Execution
Participants:

 Description   

When run under a Serverless environment, we want to dynamically attach a RetriableError label to the TemporarilyUnavailable error code under the assumption that the higher layers will throttle themselves.



 Comments   
Comment by Louis Williams [ 23/May/23 ]

Closing in favor of implementing more comprehensive load-shedding strategy in the server.

Comment by Louis Williams [ 03/Mar/22 ]

Pausing work on this until we determine whether SERVER-64050 is sufficient to just raise the timeout delay inside the server.

Comment by Esha Maharishi (Inactive) [ 24/Feb/22 ]

louis.williams, this matches my understanding. We acknowledged that by attaching RetryableWriteError, drivers will only be able to retry retryable writes, not regular writes, in two places:

From theĀ Slack conversation a while ago, where we agreed to do (1) for now, then likely eventually (2):

(1) Hook into the drivers' existing retry system for now by having the server attach a retryable error label when running in Serverless. This might require the server to return a different error in Serverless and non-Serverless, to get around how we statically declare error codes (though @louis.williams, it looks like the decision for what error label to attach is done dynamically in C++?). This also has the limitation of only being able to use our existing error labels, which don't apply to every operation.

(2) Make Atlas Proxy retry internally on the error.

From theĀ Alternative to transactions larger than cache doc:

  • SERVER-60839: Make MongoDB retry a finite number of times on a transaction that was aborted due to dirty cache being full, then return an error with a retryable error label.
    • Interaction with drivers
      • If the operation is a retryable write, the driver will retry the write once (currently) or up to timeoutMs (after Client Side Operations Timeout). The error will not cause the driver to close any connections or suddenly open many new connections, the way a socket timeout would. We can only speculate how the application will handle the error.
      • If the operation is not a retryable write, the driver will return the error to the application without retrying.

matt.broadstone, thanks for documenting the options for improving the driver retry. Just a note I think there was generally interest in having Atlas Proxy/mongos/mongoq do the retries in the long term, so that we don't have to update all drivers (and users don't have to upgrade their drivers).

Generated at Thu Feb 08 05:57:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.