[CSHARP-1343] Ability to set a retry policy Created: 02/Jul/15  Updated: 08/Apr/19  Resolved: 08/Apr/19

Status: Closed
Project: C# Driver
Component/s: Connectivity, Error Handling
Affects Version/s: 2.0.2
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Bret Ferrier Assignee: Unassigned
Resolution: Duplicate Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Cloud


Issue Links:
Related
is related to CSHARP-2026 All writes retryable support Closed
is related to CSHARP-2482 Full implementation of retryable reads Closed
is related to CSHARP-2512 Support Retryable Writes on by Default Closed

 Description   

So I am running a website on Windows Azure Websites with 2-4 instances connecting to a mongo replica set. My site was originally running on a VM and has migrated to "The Cloud". Since having moved it and upgrading to the 2.0 driver the number of Mongo errors that I have seen has skyrocketed.

I have read many posts and changed the idle timeout to 45 seconds (Azure has upped is socket idle timeout I believe to 4 minutes now...) and have had to implement retry logic and sprinkle it over the whole code base which is very ugly and painful to handle the socket errors that I am seeing.

The SQL Server driver was updated sometime ago to handle these "Transient Connection Errors" https://msdn.microsoft.com/en-us/library/azure/ff394106.aspx.

To make mongo easier to use in the cloud it should have some sort of retry logic/policy built into it. Ideally It would be something that could be set on the IMongoClient with overrides where appropriate.



 Comments   
Comment by Ian Whalen (Inactive) [ 18/Mar/19 ]

bret@cityspark.com sorry it took a while to get back to you on this but we have since implemented retryable writes in 2.5 (see CSHARP-2026) and are going to be turning it to on by default in 2.9.0 (see CSHARP-2512). We're also planning on adding retryable reads in 2.9.0 as well (see CSHARP-2482).

Given that we've split this work out, it probably makes sense for us to close this ticket now and point to those two, but please let us know if we've missed anything you're looking for here.

Comment by VItaliy [ 13/Jul/18 ]

This would be a great feature.

 

Currently I have to wrap all my mongo calls with Polly retry policy.

Comment by Bret Ferrier [ 03/Jul/15 ]

So a few weeks ago I did submit an Issue and didn't really get anywhere (CSHARP-1303) other than perhaps the issue was due to the network being "flaky". With that said if you look at the SQL driver it handles specific errors and allows you to set a policy of when you want to retry the operation

https://msdn.microsoft.com/en-us/data/dn456835.aspx
https://msdn.microsoft.com/en-us/library/system.data.entity.sqlserver.sqlazureexecutionstrategy(v=vs.113).aspx
http://www.asp.net/aspnet/overview/developing-apps-with-windows-azure/building-real-world-cloud-apps-with-windows-azure/transient-fault-handling

In my case 95% of all the retry logic that I have been having to add to the code base could have been avoided if the driver retried when it was a low level socket exception as the exception is being thrown on connecting and could safely be retried for an update/insert/read etc. In the example of the SQLAzureExecutionStrategy you can create your own class that specifies how long to wait between retries and what types of commands or exceptions to retry on but most people just use the SQLAzureExecutingStrategy.

As something doesn't exist at a lower level and I am trying to avoid forking the code base and adding it myself I have been using extension methods like the ones below for when a query is executed but it would be a lot nicer to have something more sophisticated and built in.

    public static class MongoRetryHelpersAsync
    {
        public static Task<TProjection> FirstOrDefaultAsyncWithRetry<TDocument, TProjection>(this IFindFluent<TDocument, TProjection> find, int reTryTimes = 2, Action<Exception> logThis = null, bool throwLast = false, int msSleep = 20)
        {
            return Retry.TimesAsAwaitable<TProjection>(reTryTimes, () =>
            {
                return find.FirstOrDefaultAsync<TDocument, TProjection>();
            }, logThis:logThis, throwLast: throwLast, msSleep:msSleep);
        }
 
        public static Task<List<TProjection>> ToListAsyncWithRetry<TDocument, TProjection>(this IFindFluent<TDocument, TProjection> find, int reTryTimes = 2, Action<Exception> logThis = null, bool throwLast = false, int msSleep = 20)
        {
            return Retry.TimesAsAwaitable<List<TProjection>>(reTryTimes, () =>
            {
                return find.ToListAsync();
            }, logThis: logThis, throwLast: throwLast, msSleep: msSleep);
        }
 
    }
 
   public class Retry
  {
        public static async Task<T> TimesAsAwaitable<T>(int number, Func<Task<T>> doThis, Action<Exception> logThis = null, bool throwLast = false, int msSleep = 20) 
        {
            for (int x = 0; x < number; x++)
            {
                try
                {
                    if (x > 0 && msSleep > 0)
                    {
                        await Task.Delay(msSleep);
                    }
                    return await doThis();
                }
                catch (Exception ex)
                {
                    if (logThis != null)
                        try
                        {
                            logThis(ex);
                        }
                        catch (Exception) { ;}
                    if (x + 1 == number && throwLast)
                        throw;
                    continue;
                }
            }
            return default(T);
        }
    }

Comment by Craig Wilson [ 03/Jul/15 ]

Unless I have mis-read, the linked documentation for SQL Server indicates that you are able to retry these errors, indicating that you would also have to sprinkle retry logic throughout your codebase.

While there are certainly cases where we might be able to perform a retry automatically (queries), in many cases it is going to be a business decision about whether or not it is safe to attempt a write again. We are having this discussion internally across all drivers in order to be consistent.

In the meantime, a stark increase in errors is certainly alarming. If you would like some help figuring out what these errors are, please post to our google discussion forum and we can open a jira ticket if we can't figure it out.

Craig

Generated at Wed Feb 07 21:39:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.