[SERVER-2400] Differentiating transient from permanent erros Created: 24/Jan/11  Updated: 29/Aug/11  Resolved: 11/Aug/11

Status: Closed
Project: Core Server
Component/s: Internal Client, Stability
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Unassigned
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

I propose adding a mechanism and/or specification to easily differentiate transient errors from permanent ones to make writing robust client-side code easier.
Transient error (replica set failure for example, leading to having no primary for a few seconds) can eventually go away, so client should retry the operation if it was doing reads or multiple-safe writes, while permanent error (protocol violation, database corruption) would never go away and should be immediately signaled to the client.
Currently, every driver and client have to find thier own set of errors (experimentally, by simulating failure) that "can be retried" to ensure smooth operation in the face of hardware/software failure. Adding the proposed "transient" flag (or any other mechanism, this proposal doesn't pretend to be ideal) along with supporting it in the existing driver code, probably even with auto-retry, ensuring transparent reads for failures if enabled by user, would greatly improve harware failure resilience of client-side code.



 Comments   
Comment by Aristarkh Zagorodnikov [ 10/Aug/11 ]

I would like to comment that we no longer interested in this feature, since it appears that building a simple table and a set of rules is enough to cover like 98% of the cases (by frequency of occurence). So, if no one is interested in this, I believe this case can be closed.

Comment by Aristarkh Zagorodnikov [ 24/Jan/11 ]

This case is actually closely related to the http://jira.mongodb.org/browse/CSHARP-155.
I checked out the error codes, but the only list I found online (http://www.mongodb.org/display/DOCS/Error+Codes) is a bit outdated.
Is there (I'm fine with any format, including plain text) an up-to-date list of all error codes (should be a very useful thing for driver developers), or maybe some kind of predefined ranges/severity tables (like the HTTP status code ranges)?

My point that in many cases to resolve an intermittent problem, you can just repeat the call (given that you have some kind of client-side concurrency control for your updates), but there are some cases where retrying is useless and failure is permanent. If there was a mechanism that signals that "this error requires DBA/developer attention and is not a temporary failure which should be resolved automatically in a few seconds" (or the exact opposite "this error should be transient, because it looks like it's a connection problem"), the driver and/or client code could make more informed decisions about possible outcomes of retrying the operation.

My exact case is performing insers/queries on a mongos instance (using a C# driver) with several replica sets. I would like to know the difference between "cluster is destroyed, sound the alarm" and "we lost connectivity to the primary of one of shard, please come back a few seconds later" without testing for every available error message, code and exception type, since this is very much error-prone, since while MongoDB is actively developed, error messages might change, new codes can get added and old removed (also related to http://jira.mongodb.org/browse/CSHARP-153).

Comment by Eliot Horowitz (Inactive) [ 24/Jan/11 ]

There already are error codes associated with this, so not sure if adding anything else makes sense.

Generated at Thu Feb 08 02:59:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.