Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Incomplete
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.0.2
Component/s: Stability
Labels:
None
Environment:
Windows x64

Operating System:
Windows
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We've posted on mongo-users what we believe is our core issue right now, namely the queued writers which never seem to go down once they've reached a threshold.

The thread is here: http://groups.google.com/group/mongodb-user/browse_thread/thread/195264a598d87393#

We've been able to reproduce the problem yesterday. After a --repair, the server seemed fine for a while, and when copying on the server the log file (the -v switch proved itself very verbose indeed, we had over 2GB of logs for a few hours of activity), the qw figure shot up drastically, as did the inbound connections (the clients were trying to compensate for the blocking queries), right before the server stopped serving queries altogether.

I've uploaded on jira part of the logs for this episode...

Later on, when we issued a repair on the database, the log was

Tue Jan 17 00:09:45 [initandlisten] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: Main.cachedItems top: { opid: 220, a
ctive: true, waitingForLock: false, secs_running: 0, op: "getmore", ns: "Main.cachedItems", query: {}, client: "0.0.0.0:0", desc: "initandli
sten", numYields: 0 }
Tue Jan 17 00:09:46 [initandlisten] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: Main.cachedItems top: { opid: 221, a
ctive: true, waitingForLock: false, secs_running: 0, op: "getmore", ns: "Main.cachedItems", query: {}, client: "0.0.0.0:0", desc: "initandli
sten", numYields: 0 }

(and so on for the 223 stuck queued writers).

So it looks like some of our queries (we're implementing a poor-man distributed lock using atomic sets and checks) may be problematic when mongod is under duress (IO starvation, or other factors). We're moving the lock mechanism to a dedicated instance, and we're working more generally on our data access profile to minimize writes wrt reads, but we need to find out the root causes for this incident.

Assignee:: Eric Milkie
Reporter:: YannPierre CouzySchwartz
Participants:: Eliot Horowitz, Eric Milkie, YannPierre CouzySchwartz, Yann Schwartz
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Jan 18 2012 03:28:10 PM UTC
Updated:: Aug 15 2012 02:24:15 PM UTC
Resolved:: Mar 02 2012 07:32:15 PM UTC

Details

Description

Attachments

Activity

People

Dates