-
Type: Improvement
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: 1.3.1
-
Component/s: None
-
None
-
Minor Change
In our production, when MongoDB server is close to hitting I/O subsystem limits (we have a write-intensive application that makes MongoDB spend much time write-locked, but that's another story), the mean query execution time starts to raise when MongoDB flushes the maps (every 60 secs by default), which is completely expected. This, of course, leads to additional connections being opened, because requests still arrive at a steady pace. If the aforementioned flush takes about a second (yes, I know this isn't a healthy value, but currently we have to stick with this), the connection pool overflows and our backend threads begin getting TimeoutException (hopefully CSHARP-393 would changed that to something more expected), since they're hitting the wait queue timeout.
I wonder why WaitQueueTimeout is so low by default (500ms)? If the connection pool is close to exhaustion because of an intermittent slowdown due to disk flush, LVM snapshotting, OS cache pollution, etc., most of the threads just won't make it in 500ms. If there is no reason for it to be so low, I propose it to become 1/2 of connection timeout, which would translate to 15 seconds.
While I understand that in the case of severe server slowdown this increase would lead to a lot of hanging threads, in typical parallel client scenarios (ASP.NET, WCF services, etc.), these threads would be provided from thread pool that has an effective limit that depends on the amount of processors, so it would not oversaturate the OS with excessive amount of threads and won't exhaust memory with thread stacks. The benefits of increasing are clear – 1-2 second server "lag" would not abort like half of requests (our scenario) just because connection pool was saturated at that time.
I also understand that we can set it on per-database basis in client code, but I still recommend that default should be changed, people with low loads won't notice the change, and people that are close to their servers' capacity would not hit the truckload of TimeoutExceptions in their face =)
Of course, it may be that I miss some point that justifies the default low value, sorry to bother you with this if I really missed something.