[CSHARP-1144] Investigate tests busting over the wait queue size Created: 05/Dec/14 Updated: 17/Mar/22 Resolved: 22/Oct/20 |
|
| Status: | Closed |
| Project: | C# Driver |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Craig Wilson | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 11 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Description |
|
See here for an example: https://travis-ci.org/mongodb/mongo-csharp-driver/builds/42987125#L255 |
| Comments |
| Comment by Jeffrey Yemin [ 10/Mar/18 ] | |||||||||||||||||||||||||||||||
|
PR: https://github.com/mongodb/mongo-csharp-driver/pull/316 | |||||||||||||||||||||||||||||||
| Comment by Jeremy Stafford [ 07/Aug/17 ] | |||||||||||||||||||||||||||||||
|
Are we sure this is limited to Mono? I'm having the same issue in plain ole .Net. I have my max connection pool set to 165, my max wait queue set to 3 times that, and I'm using a SemaphoreSlim set to 100 but I'm still getting errors stating the wait queue is full. | |||||||||||||||||||||||||||||||
| Comment by Drew Burden [ 06/Sep/16 ] | |||||||||||||||||||||||||||||||
|
That makes sense. Next time we perform server maintenance, we will update our mono and see if the issue persists. | |||||||||||||||||||||||||||||||
| Comment by Robin Munn [ 06/Sep/16 ] | |||||||||||||||||||||||||||||||
|
Drew Burden's comment led me to search the Mono source code, and I believe I've found the race condition in Mono that's triggering this bug. In Mono 3.12 and earlier, the SemaphoreSlim implementation of Wait(int millisecondsTimeout, CancellationToken cancellationToken) looks like this in relevant part:
Note the order of those first few lines: first the stop condition is checked, and then the current count is examined. If SemaphoreSlim.Wait is called with a timeout of 0 ms, and it just so happens that between the time the stopwatch is started and the time the stop condition is checked, at least 1 millisecond has already elapsed (because the server was under heavy load), then the current count will never be examined, and Wait() will return false even though the wait queue was far from full. In Mono 4.0, they switched to using the Microsoft reference sources for SemaphoreSlim, and the Microsoft reference sources don't suffer from this race condition. So that's why some people have reported that this bug goes away in Mono 4.0 – because in Mono 4.0, SemaphoreSlim.Wait(0) no longer reports spurious failures. But in Mono 3.12 and earlier, SemaphoreSlim.Wait(0) cannot be relied upon. If it returns false, it might be because 1 ms just happened to pass between when it started its timer and when it first tried to check the number of available threads – and since it checks the timer before checking the number of available threads in the queue, it will return spurious failures, which then turn into MongoWaitQueueExceptions being incorrectly thrown. | |||||||||||||||||||||||||||||||
| Comment by Drew Burden [ 02/Sep/16 ] | |||||||||||||||||||||||||||||||
|
On line 408 of ExclusiveConnectionPool.cs
I see that it is calling this method on the SemaphoreSlim: https://msdn.microsoft.com/en-us/library/dd289488(v=vs.110).aspx with an argument of 0ms. When this method returns false (indicating the semaphore is not entered), it will throw the WaitQueueFullException, even if the wait queue isn't full. The thread just has to fail to enter the semaphore, and the driver will think the queue is full. Is it possible that some sort of race condition is causing the wait to fail immediately since it is given no time to actually wait? | |||||||||||||||||||||||||||||||
| Comment by Drew Burden [ 02/Sep/16 ] | |||||||||||||||||||||||||||||||
|
We downloaded and recompiled the source code for the C# driver, and we put in some debug output logging the size of the wait queue any time something is trying to be added to it. We watched the wait queue slots maintain around 20,000 slots available (our arbitrarily high queue limit), dipping occasionally to 19,995 or so, and then all of a sudden, we get the wait queue full exception. Then right after the exception, we will see it saying that there's still around 20k available slots in the queue. This seems to suggest that the wait queue exception is being thrown under circumstances in which the wait queue is not actually full. Unfortunately, I don't know enough about the inner workings of the driver to be able to provide any good ideas as to why this might be happening. | |||||||||||||||||||||||||||||||
| Comment by Drew Burden [ 02/Sep/16 ] | |||||||||||||||||||||||||||||||
|
We are having issues with this as well. Our wait queue seems to fill up even though the connections in our pool have not yet been used. Or, if they are being used, something else weird is going on. We have a client connection pool size of 8192, yet mongostat never shows more than about 600 connections at any given time, even though mongo reports something like 50k connections available. This is severely detrimental to our production environment, as it is causing data to be lost when the wait queue exception is thrown. | |||||||||||||||||||||||||||||||
| Comment by Dave New [ 17/May/16 ] | |||||||||||||||||||||||||||||||
|
I'm experiencing the same problem while using around 10 parallel InsertManyAsync(collection) calls. The collections inserted are only a few hundred, small documents. | |||||||||||||||||||||||||||||||
| Comment by Zeioth [ 03/Apr/16 ] | |||||||||||||||||||||||||||||||
|
I'm recieving the same exception. I'm performing 3 find operations of 1m results each one: My only alternative right now is to do it sync. | |||||||||||||||||||||||||||||||
| Comment by DragosC [ 21/Mar/16 ] | |||||||||||||||||||||||||||||||
|
I must admit, I agree with ZunTzu. If you're really close to the maximum number of waiting elements - it's enough a small network jam to make everything explode. Currently, I was able to reach this problem only by inserting around 1M documents via InsertManyAsync from a single thread (ok, the call was made from a single-thread in the entire application). | |||||||||||||||||||||||||||||||
| Comment by ZunTzu [ 15/Oct/15 ] | |||||||||||||||||||||||||||||||
|
Maybe the timeout parameter was not such a great idea after all. It doesn't matter: I am more interested in blocking than throwing anyway. Increasing the wait queue size will not help in my use case. Imagine you want to insert millions (or more) of documents using the .NET driver. There would be no correct value for the size of the queue. The only viable solution right now would be to throttle the program, for instance by Waiting on every task, to avoid running out of memory. It is possible to do this, but it is a bit awkward. For the time being I will work around the exception with my own throttling algorithm. Thanks for the insight into the working of the driver. | |||||||||||||||||||||||||||||||
| Comment by Craig Wilson [ 15/Oct/15 ] | |||||||||||||||||||||||||||||||
|
I'm not sure this is unexpected. There is a constrained set of resources (connections) and a well-defined and tunable wait-queue size to get in line for those connections. Once that queue is full, I don't think anyone would expect that we'd have yet another queue... However, we can emulate the behavior you want already. Set the wait queue size to a very, very large number such that you will never hit it. Subsequently, you can provide a cancellation token which will be used once inside the wait queue. You can provide any type of a cancellation token, including one that has a timeout associated with it. Since we accept cancellation tokens everywhere, we have not provided any timeout values on calls like this. As far as the behavior you are looking for, just set a very large wait queue size and you'll never hit this issue. Craig | |||||||||||||||||||||||||||||||
| Comment by ZunTzu [ 15/Oct/15 ] | |||||||||||||||||||||||||||||||
|
Hi Craig, thanks for the swift reply. It is really nice to get a rapid response! If I understand correctly your strategy is to accumulate resources downstream (wait queue entries) in order to avoid accumulating resources upstream (threads or other application resources). A way to meet everyone's needs would be to add a timeout parameter to InsertOneAsync and other similar methods. That way three choices would be readily available:
A benefit of adding this parameter is that the behavior would be more apparent to developers. I must confess I was surprised by the current behavior. | |||||||||||||||||||||||||||||||
| Comment by Craig Wilson [ 15/Oct/15 ] | |||||||||||||||||||||||||||||||
|
Hi ZunTzu, Have you increased the default wait queue size and still hit this issue? Perhaps increase it more? We chose not to wait here because the wait queue is already full. The wait queue size is much larger than the connection pool size already. When it's full, we throw. If we don't do this, there is a chance that threads pile up here, wait for extremely long periods of time, and might even cause a memory problem. Craig | |||||||||||||||||||||||||||||||
| Comment by ZunTzu [ 15/Oct/15 ] | |||||||||||||||||||||||||||||||
|
Hi! I am encountering this problem time and time again. ExclusiveConnectionPool.cs, line 155 :
My question is: why did you choose to throw an exception instead of waiting for the semaphore? | |||||||||||||||||||||||||||||||
| Comment by venkat [ 11/Jun/15 ] | |||||||||||||||||||||||||||||||
|
thank you craig, yes it is working fine as expected in 2.0.1 drivers. Venkak | |||||||||||||||||||||||||||||||
| Comment by Craig Wilson [ 11/Jun/15 ] | |||||||||||||||||||||||||||||||
|
Hi venkat, Have you increased the wait queue size? The default is 500, which is likely much too low in a web application. The ticket Craig | |||||||||||||||||||||||||||||||
| Comment by venkat [ 11/Jun/15 ] | |||||||||||||||||||||||||||||||
|
We are getting following error still. [MongoDB.Driver.MongoWaitQueueFullException] = {"The wait queue for acquiring a connection to server full."}It is critical issue and we cant go into production, please provide work around. | |||||||||||||||||||||||||||||||
| Comment by Jacob Jewell [ 01/May/15 ] | |||||||||||||||||||||||||||||||
|
I'm not 100% sure if this is the case, but after we upgraded our server from mono 3.12 to 4.0 the issue with the wait queue seems to have gone away. | |||||||||||||||||||||||||||||||
| Comment by daniel moqvist [ 29/Apr/15 ] | |||||||||||||||||||||||||||||||
|
We are also seeing this outside of unit testing. We have not moved on to use the new async API in the driver, we have just ported our code from driver 1.10 to 2.0 and using the commands in the legacy driver assembly. We are runing the Save and Find methods in a Parallel.ForEach. We never had the problem with the old 1.10 driver, but it came with the 2.0 driver. We are running in Windows environment. The exception we get is: at MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.<AcquireConnectionAsync>d__0.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at MongoDB.Driver.Core.Servers.ClusterableServer.<GetChannelAsync>d__0.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.ConfiguredTaskAwaitable`1.ConfiguredTaskAwaiter.GetResult() at MongoDB.Driver.Core.Operations.FindOperation`1.<ExecuteAsync>d__2.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult() at MongoDB.Driver.Sync.IOperationExtensions.Execute[TResult](IReadOperation`1 operation, IReadBinding binding, CancellationToken cancellationToken) at MongoDB.Driver.MongoCursor`1.GetEnumerator() at System.Collections.Generic.List`1.InsertRange(Int32 index, IEnumerable`1 collection) | |||||||||||||||||||||||||||||||
| Comment by Craig Wilson [ 27/Apr/15 ] | |||||||||||||||||||||||||||||||
|
Yes, we are still trying to determine if this is expected or not. The issue relates to the fact that, since we are async, we are handing the threads back to ASP.NET which makes it think it can do more. However, we still have a connection pool that has a max size, so at some point, your app is going to try to do more than the driver is allowed to handle. While MongoDB and the driver can handle the load, I've only seen this issue show up when a network issue happens and one or more servers get disconnected temporarily. We are still working through the best way to combat this problem. Craig | |||||||||||||||||||||||||||||||
| Comment by Richard Kubina [ 27/Apr/15 ] | |||||||||||||||||||||||||||||||
|
I have seen this outside of unit testing recently and am investigating if the use of Parallel.ForEach without any ParallelOptions (like MaxDegreeOfParallelism) set is causing it. | |||||||||||||||||||||||||||||||
| Comment by Jacob Jewell [ 08/Apr/15 ] | |||||||||||||||||||||||||||||||
|
We've been seeing this after upgrading to the new mongo driver with our tests and on our QA api. Our code is relatively the unchanged from the 1.10 driver. We've tried increasing the maxPoolSize to 300 but we still get the error occasionally. | |||||||||||||||||||||||||||||||
| Comment by Robert Stam [ 05/Feb/15 ] | |||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||
| Comment by Craig Wilson [ 05/Feb/15 ] | |||||||||||||||||||||||||||||||
|
|