[SERVER-4086] OpenCursors spikes and needs primary reset to recover Created: 17/Oct/11  Updated: 24/Oct/11  Resolved: 24/Oct/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Brent Pitman Assignee: Kristina Chodorow (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linix, 1.8.3, aws:m2.4xlarge, filesystem on tmpfs


Attachments: PNG File mongodb_cursors.png    
Issue Links:
Related
Operating System: ALL
Participants:

 Description   

Something happens and our open cursors start climbing. We alert internally when it hits a number much higher than normal (55). To recover, we restart the primary. Stopping the primary sometimes fails and requires kill -9. We regularly lose data when we stop the primary (we're ok with some data loss, but the loss seems extreme - orders of magnitude more than shutting down a mysql master and moving to a replica). Note that this happens several times per week (it can happen 5-10 times in a 24 hour period). The last event started at ~12:50am PDT, 10/17/2011 (our alarm went off at ~2:00am and the master was restarted at ~2:20am).

See mms:
clsol:PRIMARY> rs.conf()
{
"_id" : "clsol",
"version" : 1,
"members" : [

{ "_id" : 0, "host" : "ec2-50-17-247-64.compute-1.amazonaws.com:27017" }

,

{ "_id" : 1, "host" : "ec2-50-17-247-65.compute-1.amazonaws.com:27017" }

,

{ "_id" : 2, "host" : "ec2-50-17-247-66.compute-1.amazonaws.com:27017" }

]
}



 Comments   
Comment by Brent Pitman [ 24/Oct/11 ]

Please close this ticket and move discussion to Community Private SUPPORT-169.

Comment by Kristina Chodorow (Inactive) [ 24/Oct/11 ]

You can open a case in https://jira.mongodb.org/browse/SUPPORT and it'll only be accessible to you and 10gen.

Have you tried running db.currentOp() when the number of cursors gets high? If not, can you try that next time?

Comment by Brent Pitman [ 24/Oct/11 ]

This appears to be a public viewable ticket, so we can't attach the log file. It's 18MB (compressed). How would yo like us to get it to you?

Comment by Brent Pitman [ 24/Oct/11 ]

It happened again early this morning at nearly the same time as last week. Attached is a graph showing week-over-week open cursors. In bath cases, we "recovered" by restarting the primary.

Comment by Spencer Brody (Inactive) [ 18/Oct/11 ]

Can you attach your mongod logs? They generally compress well from all the repeated text, so if you zip/tar them they should be quite small.

Comment by Brent Pitman [ 18/Oct/11 ]

We use the perl driver.

I'm not sure how to characterize the queries. The main table has 4 fields and ~20M entries. Two of them (node,ds) are used for most queries. There is an index and most queries have zero or one match, but can have as many as three. We do two expensive queries to find all distinct nodes (~5000 of them, reuse the index) and all distinct slots (~50 of them, no index). We also do queries for all filenames for a slot (~250,000) and all ds for a node (~10,000). There don't appear to be any system resources being stressed.

Comment by Eliot Horowitz (Inactive) [ 18/Oct/11 ]

What kinds of queries are you doing?
What driver?

Comment by Brent Pitman [ 17/Oct/11 ]

I don't know what's causing the high cursors. We know that as it gets worse, we start timing out and retrying more (probably adding to the problem). But we don't know what starts it.

Comment by Brent Pitman [ 17/Oct/11 ]

Data loss depends on what else is going on. Our max insert rate into mongodb seems to be ~200k per min (I'd like to work with you to see if we can increase this by 1-2 orders of magnitude). With a 10 sec delay, we're looking at 30k+ lost.

I'll look into the w=2 option. We're storing data related to metrics, so we're pretty sensitive to latency (if things are too slow, we can choose to either fall behind and drop data - we choose the latter). I need to see what w=2 does to overall latency/thoughput.

Comment by Eliot Horowitz (Inactive) [ 17/Oct/11 ]

How much data loss are you seeing?

One option to throttle is it to have the code doing the writes occasionally do a w=2.

Do you know what's causing the high cursors?

Comment by Brent Pitman [ 17/Oct/11 ]

Journaling is off by default, correct?

mongod 20187 1 3 Oct13 ? 03:54:54 /usr/bin/mongod -f /etc/mongod.conf

logpath = /logs/mongo/mongod.log
logappend=true
fork = true
dbpath = /dev/shm/mongod
replSet = clsol
rest = true
oplogSize = 4096

The replication lag is variable. Our app is generally busy during the first 1/2 of the min and much less so during the second half. During the first half, it's common to see a lag of 5-10 secs. It's usually 0-1 sec during the latter half. However, when the number of cursors gets really high (1000's), we see the replication lag exceed 15 minutes. Now that we alert when we reach 55 open cursors, though, we should never see replication get this far behind.

Comment by Eliot Horowitz (Inactive) [ 17/Oct/11 ]

Are you running with journalling?

What's the normal secondary lag?

Generated at Thu Feb 08 03:04:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.