Details
-
Bug
-
Resolution: Incomplete
-
Major - P3
-
None
-
2.0.2
-
None
-
Windows x64
-
Windows
Description
We've posted on mongo-users what we believe is our core issue right now, namely the queued writers which never seem to go down once they've reached a threshold.
The thread is here: http://groups.google.com/group/mongodb-user/browse_thread/thread/195264a598d87393#
We've been able to reproduce the problem yesterday. After a --repair, the server seemed fine for a while, and when copying on the server the log file (the -v switch proved itself very verbose indeed, we had over 2GB of logs for a few hours of activity), the qw figure shot up drastically, as did the inbound connections (the clients were trying to compensate for the blocking queries), right before the server stopped serving queries altogether.
I've uploaded on jira part of the logs for this episode...
Later on, when we issued a repair on the database, the log was
Tue Jan 17 00:09:45 [initandlisten] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: Main.cachedItems top: { opid: 220, a
ctive: true, waitingForLock: false, secs_running: 0, op: "getmore", ns: "Main.cachedItems", query: {}, client: "0.0.0.0:0", desc: "initandli
sten", numYields: 0 }
Tue Jan 17 00:09:46 [initandlisten] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: Main.cachedItems top: { opid: 221, a
ctive: true, waitingForLock: false, secs_running: 0, op: "getmore", ns: "Main.cachedItems", query: {}, client: "0.0.0.0:0", desc: "initandli
sten", numYields: 0 }
(and so on for the 223 stuck queued writers).
So it looks like some of our queries (we're implementing a poor-man distributed lock using atomic sets and checks) may be problematic when mongod is under duress (IO starvation, or other factors). We're moving the lock mechanism to a dedicated instance, and we're working more generally on our data access profile to minimize writes wrt reads, but we need to find out the root causes for this incident.