[SERVER-3996] Backport fix for SERVER-3002 to v1.8 branch Created: 01/Oct/11 Updated: 02/Aug/18 Resolved: 20/Oct/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 1.8.2 |
| Fix Version/s: | 1.8.4 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Ben Becker | Assignee: | Gregory McKeon (Inactive) |
| Resolution: | Done | Votes: | 3 |
| Labels: | CursorCache, cursor, timeout | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 10.04 |
||
| Attachments: |
|
| Participants: |
| Description |
|
We seem to be encountering this issue in v1.8.2 and wanted to see about backporting the fix for The fix in the v2 (related to While that fix is perfectly valid and seems like it could easily be backported, I was also curious a c++11 feature like map::erase()'s return iterator is allowed in the current Mongo source. My thinking is we can replace the fix here: _cursors.erase( i ); with: i = _cursors.erase Though perhaps I'm missing another reason where it's better to start from the begin()nig... I assumed the scoped lock for _mutex would prevent any updates to the _cursors map, and it looks like the timeout check uses the 'now' value acquired before we start iterating. Tangentially, it looks like CursorCache::~CursorCache() doesn't acquire the _mutex lock before checking _cursors.size(), though I'm unsure if this is actually an issue. Anyway, the issue we're seeing just started recently, and has been observed on 5 out of 6 servers that run mongos (all nearly identical ec2 instances). Here's the portion of the log file that contains the error, and apologies in advance for the lack of verbose output (occurs only on our production servers, but I can enable verbose logging if it will help). Sat Oct 1 05:00:16 [cursorTimeout] killing old cursor 3588663744748048245 idle for: 600028ms If this segv appears unrelated to |
| Comments |
| Comment by Greg Studer [ 20/Oct/11 ] |
|
backported to 1.8.4 |
| Comment by auto [ 20/Oct/11 ] |
|
Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}Message: backport of fix for cursor timeout iteration, from jira discussion |
| Comment by Ben Becker [ 17/Oct/11 ] |
|
Carlos, I'm afraid I don't have much insight to offer at this point, but if time permits I'll dig further into the cursor log entries and report back (or open a new issue). |
| Comment by Carlos Rodriguez [ 17/Oct/11 ] |
|
Ben, I tried applying the patch in your fork to 1.8 nightly, but after switching to the patched mongos, a few hours later saw a slew of "too many attempts to update config" exceptions from the php driver, which is at 1.2.6 (could be |
| Comment by Ben Becker [ 17/Oct/11 ] |
|
Just to follow up: All servers have been up and running with the supplied v1.8.2 patch for over a week without incident. The optimization for the v2 branch has not been tested yet. |
| Comment by Ben Becker [ 07/Oct/11 ] |
|
Submitted github pull requests for the patch to be applied to the v1.8 branch. Submitted a secondary pull request to replace the original patch for All testing has been successful with the supplied patch so far, but no testing has been done with the same patch against v2. |
| Comment by Ben Becker [ 06/Oct/11 ] |
|
Just to update: I've pushed this to two more servers, and the only server that has crashed with the segv in CursorCache::doTimeout is the unpatched one. |
| Comment by Ben Becker [ 04/Oct/11 ] |
|
The supplied patch has held up for 3 days now with no trouble so far. It's been running alongside other servers without the patch, and thus far 3 out of 5 non-patched mongos instances have crashed in CursorCache::doTimeouts, and the patched server has been running fine (no noticeable mem leaks, db connections seem stable, etc). BTW, it would be great to sneak in this patch (or a backport of Best, |
| Comment by Carlos Rodriguez [ 04/Oct/11 ] |
|
I think I'm running into the same bug, which periodically crashes our mongos servers when under high activity. Mon Oct 3 07:33:40 [cursorTimeout] killing old cursor 5371030602413797610 idle for: 600102ms Upgrading to 2.0 is not an option for us at this point, we tried it and there were disastrous performance problems. Would love to get this in the 1.8 branch. @Ben, let me know how the patch is faring so far. I may have to use it if 1.8.4 doesn't land soon. Thanks |
| Comment by Ben Becker [ 02/Oct/11 ] |
|
Please disregard the c++11 comment earlier; easier to change the for-loop to a while-loop and post increment the iterator in the call to erase(). Currently testing a custom build from the 1.8.2 tag with the attached patch applied. I will leave this running alongside vanilla v1.8.2 servers to verify this fixes the issue reported. |