[SERVER-2327] MongoDB stuck after db.getServerStatus().globalLock.currentQueue.writers exceeds 128 (windows) Created: 04/Jan/11 Updated: 12/Jul/16 Resolved: 30/Aug/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency |
| Affects Version/s: | None |
| Fix Version/s: | 2.0.0-rc0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Remon van Vliet | Assignee: | Dwight Merriman |
| Resolution: | Done | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Reproduced on Windows 7 64-bit on 5 seperate machines. |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Operating System: | Windows | ||||
| Participants: | |||||
| Description |
|
Reproduce using java program : http://pastie.org/1428863 Change CONNECTION_COUNT to 128 and the problem disappears, 129 or higher and the problem occurs 100% of the time on fast machines. Slower machines tend to not reach the 128 current lock point. db.serverStatus().globalLock.currentQueue.writers will show 128 and not recover, regardless of load or wait time. Verify by doing db.test.getIndexes(), it never returns after this. mongostats output after test to confirm load is in fact gone : http://pastie.org/1428842 Other hosts can successfully connect to the same mongo instance. |
| Comments |
| Comment by Remon van Vliet [ 30/Aug/11 ] |
|
Hi, I'll confirm resolution when I get around to it. Thanks! |
| Comment by Eliot Horowitz (Inactive) [ 30/Aug/11 ] |
|
Can someone who had this issue confirm its fixed in 2.0.0-rc0 |
| Comment by Dwight Merriman [ 23/Aug/11 ] |
|
i think this is fixed. slimreaderwriter would be better but requires windows server 2008 R2. i do want to use it for a simple_rwlock as it's better – for mmmutex specifically – so i'm leaving this open until then. that will be 2.2. (doing a simple rwlrock with no 'try' supports slightly older OS versions) |
| Comment by Eliot Horowitz (Inactive) [ 05/Aug/11 ] |
|
Code is done - just need a newer machine to build on. |
| Comment by Dwight Merriman [ 01/Jul/11 ] |
|
now works up to ~ 1000 connections but a more elaborate fix, or the slimreaderwriter locks (windows7 and 2008r2 only) required for complete fix. |
| Comment by Remon van Vliet [ 18/Mar/11 ] |
|
Just attempted a repro on 1.8.0. The hard freezes seem to be gone but there are quite a few other issues surfacing under load now. Also, the ar|aw in mongostats does still display slowly increasing numbers even when idling after a load test. I'll report the issues I'm running into now seperately. |
| Comment by Remon van Vliet [ 17/Mar/11 ] |
|
I'll try and reproduce on 1.8 when I get round to it. Thanks. |
| Comment by auto [ 17/Mar/11 ] |
|
Author: {u'login': u'dwight', u'name': u'Dwight', u'email': u'dwight@10gen.com'}Message: comments |
| Comment by Dwight Merriman [ 17/Mar/11 ] |
|
some work was done on this and is in 1.8. can you try 1.8 and LMK if it still happens. Specifically a patch to boost mutex: // in rwlock.h
But in the future will use slim reader writer locks on win64 which will be a better solution. |
| Comment by Eliot Horowitz (Inactive) [ 25/Jan/11 ] |
|
The boost mutex is directly causing this behavior, so its really quite simple to follow. For the potential easy fix see: |
| Comment by Doug Marien [ 25/Jan/11 ] |
|
Can you provide more information about why the boost mutex implementation is broken on Windows? I'm also curious why your fix is only available for Win 7? Can you point out the part of the mongodb code you're seeing as the cause of the deadlock/mutex issue? |
| Comment by Remon van Vliet [ 25/Jan/11 ] |
|
Reproduction video. |
| Comment by Remon van Vliet [ 25/Jan/11 ] |
|
It's more than a stats problem. After running the repro any operation on the database never returns. So doing a db.test.find() will not return at all from the shell nor will any other query. Basically, if the server reaches this point it has to be restarted before it can be used again. It accepts connections just fine though, but that's to be expected if it's a lock mutex issue of some sort. I think both Scott and Doug succesfully reproduced the issue and getting the server stuck. The reproduction code gets the server in this state on all the machines I've tried (64-bit, 32-bit is a bit hit and miss). It doesn't cause a problem for me personally since it's only our dev load test boxes that get stuck, we deploy to linux environments only. I do think in it's current state the issue should be mentioned in the downloads section at least, and this JIRA entry elevated to BROKEN. I understand that you don't want a rushed fix though so if 1.9.X is the first possible version with a fix then so be it. Let me know if I can help you guys with a repro. I could show you the issue with a desktop cast even if you'd like and are having trouble repro-ing on local machines. EDIT: The plot thickens. I cannot reproduce the issue (stats or stuck) at the moment using the same machine, data files and the same code. The only "change" is that a windows update ran yesterday. Can anyone still reproduce this? EDIT2: Never mind, just takes a lot longer for some reason. |
| Comment by Eliot Horowitz (Inactive) [ 25/Jan/11 ] |
|
Does it cause any actual problems for you? The not so simple fix is re-writing all the mutex stuff with basic windows primitives. Given there are people running in production well now, we feel thats too risky a change to put in 1.8.0 given its in the final stretch. |
| Comment by Remon van Vliet [ 25/Jan/11 ] |
|
Hm, sorry to hear there's no simple fix available. I previously checked for known issues with boost mutexes and/or it's usage in windows and I couldn't find any so it's must be a rare issue. Are there other "not so simple" fixes available? In it's current state I would not consider the windows builds production ready so the appropriate warnings should be added to the download page. |
| Comment by Eliot Horowitz (Inactive) [ 25/Jan/11 ] |
|
Yes, both 32 and 64bit. |
| Comment by AndrewK [ 25/Jan/11 ] |
|
does the "There is no simple fix for this" comment apply to both 64bit AND 32bit machines? previously you indicated something might have been done to fix this under only 64-bit. does that comment no longer apply? |
| Comment by Eliot Horowitz (Inactive) [ 25/Jan/11 ] |
|
There is no simple fix for this. |
| Comment by Remon van Vliet [ 17/Jan/11 ] |
|
What about 32-bit windows? We're able to reproduce it on that environment as well as mentioned above. I'll confirm fix on 64-bit windows when I have time once 1.7.5 becomes available. |
| Comment by Eliot Horowitz (Inactive) [ 16/Jan/11 ] |
|
boost on windows is broken. |
| Comment by Remon van Vliet [ 05/Jan/11 ] |
|
Seems to be isolated to Windows 64 platforms. Not repro'd on linux systems or Win 32. Win 32 seems to show connection starvation though. Will look into it and post seperate issue if needed. EDIT: Scratch that, confirmed on 32 bits windows as well but takes a faster machine to get it there. |
| Comment by Remon van Vliet [ 05/Jan/11 ] |
|
Official binaries for me, specifically 1.6.3, 1.6.5 and 1.7.4. System 1 : System 2 : System 1 takes about 7 seconds to reach the lock, system 2 almost instantly. |
| Comment by Eliot Horowitz (Inactive) [ 05/Jan/11 ] |
|
Seems there might be a bug in boost mutex. |
| Comment by Doug Marien [ 04/Jan/11 ] |
|
Also for me it's not an immediate repro sometimes because my machine is doing other things but letting it run for a bit and then starting a shell to issue some queries seems to trigger it faster. |
| Comment by Doug Marien [ 04/Jan/11 ] |
|
I'm using the official Windows 64-bit binaries for 1.6.3 and 1.7.4. OS: Windows Vista Business 64-bit SP2 My quick code port in python+pymongo: http://pastie.org/1429529 |
| Comment by Eliot Horowitz (Inactive) [ 04/Jan/11 ] |
|
Also, was this with official binaries? |
| Comment by Eliot Horowitz (Inactive) [ 04/Jan/11 ] |
|
Can you paste exact OS/Mongo versions. |
| Comment by Doug Marien [ 04/Jan/11 ] |
|
I'm also able to reproduce this on Windows Vista 64-bit quad-core using 1.6.3 and 1.7.4 using that test re-written for python+pymongo. Seems to be a timing issue because if I introduce some load on the machine then I'm unable to reproduce the lockup. |