[SERVER-17362] ClientCursor::staticYield can't unlock b/c of recursive lock ns Created: 24/Feb/15  Updated: 18/Jun/15  Resolved: 18/Jun/15

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 2.6.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ian Rice Assignee: Ramon Fernandez Marina
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

SSD drives
CentOS 6.5


Attachments: Zip Archive mongo-17362.zip    
Issue Links:
Related
is related to SERVER-17054 Index not being utilized, on occasion. Closed
Operating System: Linux
Steps To Reproduce:

it's intermittent and I can't replicate it. It's not even specific to times of heavy load.

Participants:

 Description   

Intermittently my logs fill with "ClientCursor::staticYield can't unlock b/c of recursive lock ns ..."[millions of entries] . It happens on a lot of different collections but always on a findAndModify. When it's at it's worse - the lock lasts 6-7 seconds but the entire database is unresponsive during this time.
I've read that it may may be related to insufficient indexing but I've checked that with a .explain and the cursor is BtreeCursor.
The issue is similar to this SERVER-15583
but an upgrade to 2.6.6 solved his whereas I'm already on 2.6.6.



 Comments   
Comment by Ramon Fernandez Marina [ 29/May/15 ]

ricei, I believe you may be running into the same issue as described in SERVER-17054; please see this comment for an explanation of that issue.

To be sure that your issue has the same root cause as that of SERVER-17054 I'd like to ask you to run the commands below and send the output:

// Save current logLevel, bump logLevel to 2
var oldLogLevel = db.adminCommand({getParameter:1, logLevel:1}).logLevel
db.adminCommand( { setParameter: 1, logLevel: 2 } )
 
db.issues.getPlanCache().clear()
db.col.find({ yppie: "pQ4c" }, {}).sort({ _id : 1}).explain(true)
db.col.getPlanCache().getPlansByQuery({ yppie: "pQ4c" }, {}, { _id : 1})
db.col.getPlanCache().listQueryShapes()
 
// Reset logLevel
db.adminCommand( { setParameter: 1, logLevel: oldLogLevel } )

You'll note that these commands increase the log level, so I'd also like to see the logs generated while this commands are run to better diagnose this issue.

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 21/May/15 ]

Thanks for the additional information ricei. The query plan cache is our main suspect indeed, the harder part is to trigger the exact behavior with a suitable dataset. Will try with unindexed queries as you suggest and report back with my findings.

Comment by Ian Rice [ 21/May/15 ]

As an update to this - I discovered that the issue was triggered by a previous unindexed query. That is - a query is run on the collection that does not use an index but rather a scan and subsequently the findAndModify which should use an index instead uses a scan. It seems the findAndModify has a race to select which method to use and the results of the previous query influence the outcome of the race resulting in the incorrect method being selected

Comment by Ramon Fernandez Marina [ 21/May/15 ]

ricei, apologies for the long delay in getting back to you. While I have not been able to reproduce this behavior locally, I wanted to let you know that this behavior has been removed in the v3.0 version and now findAndModify() queries can yield.

We'll continue to investigate why findAndModify() seems to ignore available indexes to satisfy the query, but if upgrading to 3.0.3 was an option for you it would be helpful for us if you could let us know how does 3.0.3 behave compared to 2.6.9.

Thanks,
Ramón.

Comment by Gui Forget [ 29/Mar/15 ]

We are on 2.6.9 and seeing a very similar issue. It's causing write lock contention and affecting the performance of our app.

2015-03-28T04:47:25.818-0500 [conn147815] warning: ClientCursor::staticYield can't unlock b/c of recursive lock ns:  top: { opid: 292929752, active: true, secs_running: 0, microsecs_running: 372, op: "query", ns: "community.userProductActivities", 
query: { findandmodify: "userProductActivities", query: { pid: 470509812, uid: 26502826 }, fields: { view.cViews: 1 }, update: { $set: { lm: new Date(1427536045818), view.lm: new Date(1427536045818) }, $inc: { view.cViews: 1 } }, upsert: true }, 
client: "10.171.74.4:41770", desc: "conn147815", threadId: "0x7e800772a700", connectionId: 147815, locks: { ^: "w", ^community: "W" }, waitingForLock: false, numYields: 0, lockStats: { timeLockedMicros: {}, timeAcquiringMicros: { r: 0, w: 2 } } }

I have attached the outputs of the commands requested earlier

Comment by Ian Rice [ 25/Feb/15 ]

some more logs

 
2015-02-23T21:25:31.797+0000 [conn85394] warning: ClientCursor::staticYield can't unlock b/c of recursive lock ns:  top: { opid: 23228747, active: true, secs_running: 6, microsecs_running: 6598838, op: "query", ns: "mapandmore", query: { findandmodify: "morethan160", query: { yppie: "pJ11" }, sort: { _id: 1 }, new: 1, remove: 0, upsert: 0, update: { $inc: { clickcount: 1 }, $push: { when: new Date(1424726725198), clickbucket: [ 1424649600000.0, "23-2-2015 21-hour", "23-2-2015-day", "2-2015-month", "2015-1-quarter", "2015-year" ] } } }, client: "46.137.175.200:32780", desc: "conn85394", threadId: "0x7fa593451700", connectionId: 85394, locks: { ^: "w", ^mapandmore: "W" }, waitingForLock: false, numYields: 0, lockStats: { timeLockedMicros: {}, timeAcquiringMicros: { r: 0, w: 4 } } }
2015-02-23T21:25:31.797+0000 [conn85394] query mapandmore.morethan160 query: { query: { yppie: "pJ11" }, orderby: { _id: 1 } } planSummary: IXSCAN { _id: 1 } ntoreturn:1 ntoskip:0 nscanned:2145159 nscannedObjects:2145159 keyUpdates:0 numYields:0  nreturned:1 reslen:3256 6598ms
2015-02-23T21:25:31.803+0000 [conn85394] command mapandmore.$cmd command: findAndModify { findandmodify: "morethan160", query: { yppie: "pJ11" }, sort: { _id: 1 }, new: 1, remove: 0, upsert: 0, update: { $inc: { clickcount: 1 }, $push: { when: new Date(1424726725198), clickbucket: [ 1424649600000.0, "23-2-2015 21-hour", "23-2-2015-day", "2-2015-month", "2015-1-quarter", "2015-year" ] } } } keyUpdates:0 numYields:0 locks(micros) w:6603612 reslen:3540 6604ms

2015-02-23T21:29:00.053+0000 [conn85398] warning: ClientCursor::staticYield can't unlock b/c of recursive lock ns:  top: { opid: 23232763, active: true, secs_running: 6, microsecs_running: 6379877, op: "query", ns: "mapandmore", query: { $msg: "query not recording (too large)" }, client: "46.137.175.200:32784", desc: "conn85398", threadId: "0x7fa593552700", connectionId: 85398, locks: { ^: "w", ^mapandmore: "W" }, waitingForLock: false, numYields: 0, lockStats: { timeLockedMicros: {}, timeAcquiringMicros: { r: 0, w: 5 } } }
2015-02-23T21:29:00.053+0000 [conn85398] query mapandmore.morethan160 query: { query: { yppie: "pP4c" }, orderby: { _id: 1 } } planSummary: IXSCAN { _id: 1 } ntoreturn:1 ntoskip:0 nscanned:2163399 nscannedObjects:2163399 keyUpdates:0 numYields:0  nreturned:1 reslen:1911 6380ms

2015-02-23T21:29:58.975+0000 [conn85397] warning: ClientCursor::staticYield can't unlock b/c of recursive lock ns:  top: { opid: 23234034, active: true, secs_running: 6, microsecs_running: 6412930, op: "query", ns: "mapandmore", query: { findandmodify: "morethan160", query: { yppie: "pQlY" }, sort: { _id: 1 }, new: 1, remove: 0, upsert: 0, update: { $inc: { clickcount: 1 }, $push: { when: new Date(1424726992562), clickbucket: [ 1424649600000.0, "23-2-2015 21-hour", "23-2-2015-day", "2-2015-month", "2015-1-quarter", "2015-year" ] } } }, client: "46.137.175.200:32783", desc: "conn85397", threadId: "0x7fa59324f700", connectionId: 85397, locks: { ^: "w", ^mapandmore: "W" }, waitingForLock: false, numYields: 0, lockStats: { timeLockedMicros: {}, timeAcquiringMicros: { r: 0, w: 3 } } }
2015-02-23T21:29:58.976+0000 [conn85397] query mapandmore.morethan160 query: { query: { yppie: "pQlY" }, orderby: { _id: 1 } } planSummary: IXSCAN { _id: 1 } ntoreturn:1 ntoskip:0 nscanned:2167279 nscannedObjects:2167279 keyUpdates:0 numYields:0  nreturned:1 reslen:3481 6413ms
2015-02-23T21:29:58.989+0000 [conn85397] command mapandmore.$cmd command: findAndModify { findandmodify: "morethan160", query: { yppie: "pQlY" }, sort: { _id: 1 }, new: 1, remove: 0, upsert: 0, update: { $inc: { clickcount: 1 }, $push: { when: new Date(1424726992562), clickbucket: [ 1424649600000.0, "23-2-2015 21-hour", "23-2-2015-day", "2-2015-month", "2015-1-quarter", "2015-year" ] } } } keyUpdates:0 numYields:0 locks(micros) w:6425980 reslen:3765 6426ms

It seems to be an issue with the findandmodify choosing to ignore the available index. After each of the long runs of "recursive lock" there is a statement that the planSummary is IXSCAN and nscanned:2167279 nscannedObjects:2167279
but

db..find({ yppie: "pQlY" }).explain(true)
{
	"cursor" : "BtreeCursor yppie_1",
	"isMultiKey" : false,
	"n" : 1,
	"nscannedObjects" : 1,
	"nscanned" : 1,
	"nscannedObjectsAllPlans" : 1,
	"nscannedAllPlans" : 1,
	"scanAndOrder" : false,
	"indexOnly" : false,
	"nYields" : 0,
	"nChunkSkips" : 0,
	"millis" : 0,
	"indexBounds" : {
		"yppie" : [
			[
				"pQlY",
				"pQlY"
			]
		]
	},
	"allPlans" : [
		{
			"cursor" : "BtreeCursor yppie_1",
			"isMultiKey" : false,
			"n" : 1,
			"nscannedObjects" : 1,
			"nscanned" : 1,
			"scanAndOrder" : false,
			"indexOnly" : false,
			"nChunkSkips" : 0,
			"indexBounds" : {
				"yppie" : [
					[
						"pQlY",
						"pQlY"
					]
				]
			}
		}
	],
	"server" : "apollo:27017",
	"filterSet" : false,
	"stats" : {
		"type" : "FETCH",
		"works" : 2,
		"yields" : 0,
		"unyields" : 0,
		"invalidates" : 0,
		"advanced" : 1,
		"needTime" : 0,
		"needFetch" : 0,
		"isEOF" : 1,
		"alreadyHasObj" : 0,
		"forcedFetches" : 0,
		"matchTested" : 0,
		"children" : [
			{
				"type" : "IXSCAN",
				"works" : 2,
				"yields" : 0,
				"unyields" : 0,
				"invalidates" : 0,
				"advanced" : 1,
				"needTime" : 0,
				"needFetch" : 0,
				"isEOF" : 1,
				"keyPattern" : "{ yppie: 1.0 }",
				"isMultiKey" : 0,
				"boundsVerbose" : "field #0['yppie']: [\"pQlY\", \"pQlY\"]",
				"yieldMovedCursor" : 0,
				"dupsTested" : 0,
				"dupsDropped" : 0,
				"seenInvalidated" : 0,
				"matchTested" : 0,
				"keysExamined" : 1,
				"children" : [ ]
			}
		]
	}
}

Comment by Ian Rice [ 24/Feb/15 ]

As some additional background. Some of the collections have exists for ~ 4+ years. Mongodb has been upgraded through each of the major releases. Additionally in January the storage was swapped over to SSD.

Comment by Ramon Fernandez Marina [ 24/Feb/15 ]

ricei, we're going to need more information to understand what's going on. My hunch is that, even if your findAndModify() queries are using an index, they may be using a sub-optimal index in some situations. In addition from server logs around the time when this happens (spanning at least a few seconds before and after), can you please send us the output of the following commands?

db.collection.getPlanCache().listQueryShapes()
db.collection.getPlanCache().getPlansByQuery({query-from-findAndModify-here})
db.collection.getIndexes();
db.collection.find({query-from-findAndModify-here}).explain(true);

Generated at Thu Feb 08 03:44:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.