[SERVER-2327] MongoDB stuck after db.getServerStatus().globalLock.currentQueue.writers exceeds 128 (windows) Created: 04/Jan/11  Updated: 12/Jul/16  Resolved: 30/Aug/11

Status: Closed
Project: Core Server
Component/s: Concurrency
Affects Version/s: None
Fix Version/s: 2.0.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Remon van Vliet Assignee: Dwight Merriman
Resolution: Done Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Reproduced on Windows 7 64-bit on 5 seperate machines.


Attachments: File mongostuck.avi    
Issue Links:
Related
Operating System: Windows
Participants:

 Description   

Reproduce using java program : http://pastie.org/1428863

Change CONNECTION_COUNT to 128 and the problem disappears, 129 or higher and the problem occurs 100% of the time on fast machines. Slower machines tend to not reach the 128 current lock point.

db.serverStatus().globalLock.currentQueue.writers will show 128 and not recover, regardless of load or wait time. Verify by doing db.test.getIndexes(), it never returns after this.

mongostats output after test to confirm load is in fact gone : http://pastie.org/1428842

Other hosts can successfully connect to the same mongo instance.



 Comments   
Comment by Remon van Vliet [ 30/Aug/11 ]

Hi, I'll confirm resolution when I get around to it. Thanks!

Comment by Eliot Horowitz (Inactive) [ 30/Aug/11 ]

Can someone who had this issue confirm its fixed in 2.0.0-rc0

Comment by Dwight Merriman [ 23/Aug/11 ]

i think this is fixed. slimreaderwriter would be better but requires windows server 2008 R2. i do want to use it for a simple_rwlock as it's better – for mmmutex specifically – so i'm leaving this open until then. that will be 2.2. (doing a simple rwlrock with no 'try' supports slightly older OS versions)

Comment by Eliot Horowitz (Inactive) [ 05/Aug/11 ]

Code is done - just need a newer machine to build on.

Comment by Dwight Merriman [ 01/Jul/11 ]

now works up to ~ 1000 connections but a more elaborate fix, or the slimreaderwriter locks (windows7 and 2008r2 only) required for complete fix.

Comment by Remon van Vliet [ 18/Mar/11 ]

Just attempted a repro on 1.8.0. The hard freezes seem to be gone but there are quite a few other issues surfacing under load now. Also, the ar|aw in mongostats does still display slowly increasing numbers even when idling after a load test. I'll report the issues I'm running into now seperately.

Comment by Remon van Vliet [ 17/Mar/11 ]

I'll try and reproduce on 1.8 when I get round to it. Thanks.

Comment by auto [ 17/Mar/11 ]

Author:

{u'login': u'dwight', u'name': u'Dwight', u'email': u'dwight@10gen.com'}

Message: comments SERVER-2327
https://github.com/mongodb/mongo/commit/12a4af1c493d77ea54419dada3c46571d4cb7abb

Comment by Dwight Merriman [ 17/Mar/11 ]

some work was done on this and is in 1.8. can you try 1.8 and LMK if it still happens. Specifically a patch to boost mutex:

// in rwlock.h
#if defined(_WIN32)

  1. include "shared_mutex_win.hpp"
    namespace mongo {
    typedef boost::modified_shared_mutex shared_mutex;
    }

But in the future will use slim reader writer locks on win64 which will be a better solution.

Comment by Eliot Horowitz (Inactive) [ 25/Jan/11 ]

The boost mutex is directly causing this behavior, so its really quite simple to follow.

For the potential easy fix see:
https://github.com/mongodb/mongo/blob/master/util/concurrency/rwlock.h
look for MONGO_USE_SRW_ON_WINDOWS

Comment by Doug Marien [ 25/Jan/11 ]

Can you provide more information about why the boost mutex implementation is broken on Windows?

I'm also curious why your fix is only available for Win 7? Can you point out the part of the mongodb code you're seeing as the cause of the deadlock/mutex issue?

Comment by Remon van Vliet [ 25/Jan/11 ]

Reproduction video.

Comment by Remon van Vliet [ 25/Jan/11 ]

It's more than a stats problem. After running the repro any operation on the database never returns. So doing a db.test.find() will not return at all from the shell nor will any other query.

Basically, if the server reaches this point it has to be restarted before it can be used again. It accepts connections just fine though, but that's to be expected if it's a lock mutex issue of some sort. I think both Scott and Doug succesfully reproduced the issue and getting the server stuck. The reproduction code gets the server in this state on all the machines I've tried (64-bit, 32-bit is a bit hit and miss).

It doesn't cause a problem for me personally since it's only our dev load test boxes that get stuck, we deploy to linux environments only. I do think in it's current state the issue should be mentioned in the downloads section at least, and this JIRA entry elevated to BROKEN. I understand that you don't want a rushed fix though so if 1.9.X is the first possible version with a fix then so be it.

Let me know if I can help you guys with a repro. I could show you the issue with a desktop cast even if you'd like and are having trouble repro-ing on local machines.

EDIT: The plot thickens. I cannot reproduce the issue (stats or stuck) at the moment using the same machine, data files and the same code. The only "change" is that a windows update ran yesterday. Can anyone still reproduce this?

EDIT2: Never mind, just takes a lot longer for some reason.

Comment by Eliot Horowitz (Inactive) [ 25/Jan/11 ]

Does it cause any actual problems for you?
In our testing its just a stats problem.

The not so simple fix is re-writing all the mutex stuff with basic windows primitives.

Given there are people running in production well now, we feel thats too risky a change to put in 1.8.0 given its in the final stretch.

Comment by Remon van Vliet [ 25/Jan/11 ]

Hm, sorry to hear there's no simple fix available. I previously checked for known issues with boost mutexes and/or it's usage in windows and I couldn't find any so it's must be a rare issue.

Are there other "not so simple" fixes available? In it's current state I would not consider the windows builds production ready so the appropriate warnings should be added to the download page.

Comment by Eliot Horowitz (Inactive) [ 25/Jan/11 ]

Yes, both 32 and 64bit.
The fix we've got only works (compiles) on win7, so isn't really a viable fix.

Comment by AndrewK [ 25/Jan/11 ]

does the "There is no simple fix for this" comment apply to both 64bit AND 32bit machines? previously you indicated something might have been done to fix this under only 64-bit. does that comment no longer apply?

Comment by Eliot Horowitz (Inactive) [ 25/Jan/11 ]

There is no simple fix for this.
Its a bug either in boost or a windows library boost is using, so swapping out is fairly complex.

Comment by Remon van Vliet [ 17/Jan/11 ]

What about 32-bit windows? We're able to reproduce it on that environment as well as mentioned above.

I'll confirm fix on 64-bit windows when I have time once 1.7.5 becomes available.

Comment by Eliot Horowitz (Inactive) [ 16/Jan/11 ]

boost on windows is broken.
we've replaced the lock on 64-bit windows

Comment by Remon van Vliet [ 05/Jan/11 ]

Seems to be isolated to Windows 64 platforms. Not repro'd on linux systems or Win 32. Win 32 seems to show connection starvation though. Will look into it and post seperate issue if needed.

EDIT: Scratch that, confirmed on 32 bits windows as well but takes a faster machine to get it there.

Comment by Remon van Vliet [ 05/Jan/11 ]

Official binaries for me, specifically 1.6.3, 1.6.5 and 1.7.4.

System 1 :
OS : Windows 7 Pro 64-bit.
CPU : AMD Phenom II X4 965 Black Edition
MEM : 8Gb
DISK : 64Gb Intel X-25E SSD
Mongo : 1.6.5 and 1.7.4

System 2 :
OS : Windows 7 Home Premium 64-bit.
CPU : Intel Core i7 920 2.6GHz
MEM : 12Gb
DISK : 2x 64Gb Intel X-25E SSD RAID0
Mongo : 1.6.3

System 1 takes about 7 seconds to reach the lock, system 2 almost instantly.

Comment by Eliot Horowitz (Inactive) [ 05/Jan/11 ]

Seems there might be a bug in boost mutex.
Going to try some things

Comment by Doug Marien [ 04/Jan/11 ]

Also for me it's not an immediate repro sometimes because my machine is doing other things but letting it run for a bit and then starting a shell to issue some queries seems to trigger it faster.

Comment by Doug Marien [ 04/Jan/11 ]

I'm using the official Windows 64-bit binaries for 1.6.3 and 1.7.4.

OS: Windows Vista Business 64-bit SP2
Processor: Intel Core i7 920 2.6GHz
Memory: 12GB

My quick code port in python+pymongo: http://pastie.org/1429529

Comment by Eliot Horowitz (Inactive) [ 04/Jan/11 ]

Also, was this with official binaries?

Comment by Eliot Horowitz (Inactive) [ 04/Jan/11 ]

Can you paste exact OS/Mongo versions.
We can't reproduce.

Comment by Doug Marien [ 04/Jan/11 ]

I'm also able to reproduce this on Windows Vista 64-bit quad-core using 1.6.3 and 1.7.4 using that test re-written for python+pymongo.

Seems to be a timing issue because if I introduce some load on the machine then I'm unable to reproduce the lockup.

Generated at Thu Feb 08 02:59:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.