[SERVER-4710] Writers queues never go down and server eventually stalls Created: 18/Jan/12 Updated: 15/Aug/12 Resolved: 02/Mar/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 2.0.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | YannPierre CouzySchwartz | Assignee: | Eric Milkie |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Windows x64 |
||
| Issue Links: |
|
||||
| Operating System: | Windows | ||||
| Participants: | |||||
| Description |
|
We've posted on mongo-users what we believe is our core issue right now, namely the queued writers which never seem to go down once they've reached a threshold. The thread is here: http://groups.google.com/group/mongodb-user/browse_thread/thread/195264a598d87393# We've been able to reproduce the problem yesterday. After a --repair, the server seemed fine for a while, and when copying on the server the log file (the -v switch proved itself very verbose indeed, we had over 2GB of logs for a few hours of activity), the qw figure shot up drastically, as did the inbound connections (the clients were trying to compensate for the blocking queries), right before the server stopped serving queries altogether. I've uploaded on jira part of the logs for this episode... Later on, when we issued a repair on the database, the log was Tue Jan 17 00:09:45 [initandlisten] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: Main.cachedItems top: { opid: 220, a (and so on for the 223 stuck queued writers). So it looks like some of our queries (we're implementing a poor-man distributed lock using atomic sets and checks) may be problematic when mongod is under duress (IO starvation, or other factors). We're moving the lock mechanism to a dedicated instance, and we're working more generally on our data access profile to minimize writes wrt reads, but we need to find out the root causes for this incident. |
| Comments |
| Comment by Yann Schwartz [ 02/Mar/12 ] |
|
Sorry for not giving updates. We've migrated to 2.1.0 (2008+) and the locking problem has never reappeared. We later switched to 2.0.3 and our issue is fixed for this version as well. Thank you. |
| Comment by Eric Milkie [ 28/Feb/12 ] |
|
How's your system running now with the Windows2008R2 build? |
| Comment by Eric Milkie [ 06/Feb/12 ] |
|
I tried using a build of 2.0.3 that is very close to the one you are using and still can't reproduce. Since I believe the code is essentially the same, the cause of the issue may be either the hardware, or the format or content of your data. What parameters are you using to start mongod? I did try raising the verbosity level on mine to see if the extra info would help, but unfortunately it logs every insert during the import, and thus slowed it down significantly. Other than this mongorestore issue, is it correct to say that it is running fine with the new Windows2008+ build? |
| Comment by YannPierre CouzySchwartz [ 06/Feb/12 ] |
|
Mongorestore with --host --db and -c switches (I've tried it with and without --drop) The binary we tested was compiled on Jan 30th Running mongod we get: |
| Comment by Eric Milkie [ 06/Feb/12 ] |
|
I tried dumping a 300MB database with 2.0.2 and then restored it with mongorestore using a Nightly build of 2.1.1/Windows 2008. The restore completes in 1 min 15 secs. There may be a regression in the specific build you are using. Can you give the initial log messages when you start mongod? Within the messages is the "git version" and "db version" which will point me to the specific build you are using. |
| Comment by YannPierre CouzySchwartz [ 06/Feb/12 ] |
|
Restoring on a 2.0.2 then running the latest binary pointing to the data directory works fine though. The issue seems to be with the way mongorestore works in the latest build you provided. |
| Comment by YannPierre CouzySchwartz [ 06/Feb/12 ] |
|
We did a few tests with the 2008R2 build, but when we tried to use mongorestore to restore our data, the restore process was very very slow (less than 10MB a minute, after 15 minutes it was still showing 0% progress)... We tested this on a 2GB and a 6GB dumps. The restore on the same backup files is very fast with the current 2.0.2. |
| Comment by Eliot Horowitz (Inactive) [ 23/Jan/12 ] |
|
Which build was that with in production? CAn you switch to the 2008R2+ build? |
| Comment by YannPierre CouzySchwartz [ 23/Jan/12 ] |
|
We haven't been able to reproduce the problem on our test setups. But we have in production. The relevant logs have been uploaded in the community private jira. |
| Comment by YannPierre CouzySchwartz [ 18/Jan/12 ] |
|
Mmh. The nightly build requires 2008 R2 (TryAcquireSRWLockExclusive), I'll need to switch the OS before testing, more info when everything is set up (a few hours from now). |
| Comment by YannPierre CouzySchwartz [ 18/Jan/12 ] |
|
Thank you Eliot. We'll try to reproduce on our current dev setup (2.0.2) then we'll try out your build. |
| Comment by Eliot Horowitz (Inactive) [ 18/Jan/12 ] |
|
There is a newer windows build that might help. |