[SERVER-3996] Backport fix for SERVER-3002 to v1.8 branch Created: 01/Oct/11  Updated: 02/Aug/18  Resolved: 20/Oct/11

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 1.8.2
Fix Version/s: 1.8.4

Type: Task Priority: Major - P3
Reporter: Ben Becker Assignee: Gregory McKeon (Inactive)
Resolution: Done Votes: 3
Labels: CursorCache, cursor, timeout
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 10.04
EC2 large instance
mongos v1.8.2 (from 10gen apt source)
php driver v1.0.11
mongos is connected to 2 shards along with 7 other instances of mongos on other hosts.


Attachments: File mongo_doTimeouts_fix.diff    
Participants:

 Description   

We seem to be encountering this issue in v1.8.2 and wanted to see about backporting the fix for SERVER-3002 to the v1.8 branch. Looking at the v1.8 branch, I do not see a similar fix for calling the increment operator on an erased iterator: https://github.com/mongodb/mongo/blob/v1.8/s/cursors.cpp#L268.

The fix in the v2 (related to SERVER-3002) starts iterating again from the beginning: https://github.com/mongodb/mongo/blob/v2.0/s/cursors.cpp#L277

While that fix is perfectly valid and seems like it could easily be backported, I was also curious a c++11 feature like map::erase()'s return iterator is allowed in the current Mongo source. My thinking is we can replace the fix here:

_cursors.erase( i );
i = _cursors.begin(); // possible 2nd entry will get skipped, will get on next pass

with:

i = _cursors.erase;

Though perhaps I'm missing another reason where it's better to start from the begin()nig... I assumed the scoped lock for _mutex would prevent any updates to the _cursors map, and it looks like the timeout check uses the 'now' value acquired before we start iterating. Tangentially, it looks like CursorCache::~CursorCache() doesn't acquire the _mutex lock before checking _cursors.size(), though I'm unsure if this is actually an issue.

Anyway, the issue we're seeing just started recently, and has been observed on 5 out of 6 servers that run mongos (all nearly identical ec2 instances). Here's the portion of the log file that contains the error, and apologies in advance for the lack of verbose output (occurs only on our production servers, but I can enable verbose logging if it will help).

Sat Oct 1 05:00:16 [cursorTimeout] killing old cursor 3588663744748048245 idle for: 600028ms
Received signal 11
Backtrace: 0x52f8f5 0x7f65d137aaf0 0x7f65d1bd6533 0x6711c5 0x526a03 0x50454b 0x505e04 0x6a50a0 0x7f65d1e7e9ca 0x7f65d142d70d
/usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x52f8f5]
/lib/libc.so.6(+0x33af0)[0x7f65d137aaf0]
/usr/lib/libstdc++.so.6(_ZSt18_Rb_tree_incrementPSt18_Rb_tree_node_base+0x13)[0x7f65d1bd6533]
/usr/bin/mongos(_ZN5mongo11CursorCache10doTimeoutsEv+0x75)[0x6711c5]
/usr/bin/mongos(_ZN5mongo4task4Task3runEv+0x33)[0x526a03]
/usr/bin/mongos(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0x12b)[0x50454b]
/usr/bin/mongos(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x7
4)[0x505e04]
/usr/bin/mongos(thread_proxy+0x80)[0x6a50a0]
/lib/libpthread.so.0(+0x69ca)[0x7f65d1e7e9ca]
/lib/libc.so.6(clone+0x6d)[0x7f65d142d70d]
===
Received signal 11
Backtrace: 0x52f8f5 0x7f65d137aaf0 0x532ea0 0x577654 0x577c71 0x630a9e 0x6361c8 0x66841c 0x67d187 0x580b7c 0x6a50a0 0x7f65d1e7e9ca 0x7f65d142d70d
/usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x52f8f5]
/lib/libc.so.6(+0x33af0)[0x7f65d137aaf0]
/usr/bin/mongos(_ZN5mongo16DBConnectionPool11onHandedOutEPNS_12DBClientBaseE+0x20)[0x532ea0]
/usr/bin/mongos(_ZN5mongo15ShardConnection5_initEv+0x1b4)[0x577654]
/usr/bin/mongos(_ZN5mongo15ShardConnectionC1ERKNS_5ShardERKSs+0xa1)[0x577c71]
/usr/bin/mongos(_ZN5mongo8Strategy7doQueryERNS_7RequestERKNS_5ShardE+0x4e)[0x630a9e]
/usr/bin/mongos(_ZN5mongo14SingleStrategy7queryOpERNS_7RequestE+0x4d8)[0x6361c8]
/usr/bin/mongos(_ZN5mongo7Request7processEi+0x29c)[0x66841c]
/usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x77)[0x67d187]
Sat Oct 1 05:00:16 CursorCache at shutdown - sharded: 543 passthrough: 0

If this segv appears unrelated to SERVER-3002, please let me know and I'll report more information. Thank you for your time!



 Comments   
Comment by Greg Studer [ 20/Oct/11 ]

backported to 1.8.4

Comment by auto [ 20/Oct/11 ]

Author:

{u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

Message: backport of fix for cursor timeout iteration, from jira discussion SERVER-3996
Branch: v1.8
https://github.com/mongodb/mongo/commit/85902e0e826ba543f9ea6021e0d370fcc6dd6854

Comment by Ben Becker [ 17/Oct/11 ]

Carlos,
Interesting, I have not seen those two errors you got after applying the patch... For me, mongoS was running for months before some instances started hitting this segfault (avg of 1 in 7 servers segv each day). I no longer get the segfault, but all other log output appears the same as it did before the patch, including some odd cursor related messages. So I'm starting to speculate this bug was unmasked as a symptom of the larger problem.

I'm afraid I don't have much insight to offer at this point, but if time permits I'll dig further into the cursor log entries and report back (or open a new issue).

Comment by Carlos Rodriguez [ 17/Oct/11 ]

Ben, I tried applying the patch in your fork to 1.8 nightly, but after switching to the patched mongos, a few hours later saw a slew of "too many attempts to update config" exceptions from the php driver, which is at 1.2.6 (could be SERVER-3739 ?). I then tried patching 1.8.3, but upon deploying got flooded in "can't send query" exceptions from the driver. Have not seen either of those errors with the standard 1.8.3 mongos. Still getting "killing old cursor" appearing in the mongos log about 10 times a second, with or without the patch. I'm really not sure what to do, other than try upgrading the cluster to 2.0 again (my first attempt had to be reverted due to poor query performance). Seems like everytime I get one bug fixed, 2 more pop up :-/

Comment by Ben Becker [ 17/Oct/11 ]

Just to follow up: All servers have been up and running with the supplied v1.8.2 patch for over a week without incident. The optimization for the v2 branch has not been tested yet.

Comment by Ben Becker [ 07/Oct/11 ]

Submitted github pull requests for the patch to be applied to the v1.8 branch.

Submitted a secondary pull request to replace the original patch for SERVER-3002 with a fix that doesn't require restarting from the beginning on erase().

All testing has been successful with the supplied patch so far, but no testing has been done with the same patch against v2.

Comment by Ben Becker [ 06/Oct/11 ]

Just to update: I've pushed this to two more servers, and the only server that has crashed with the segv in CursorCache::doTimeout is the unpatched one.

Comment by Ben Becker [ 04/Oct/11 ]

The supplied patch has held up for 3 days now with no trouble so far. It's been running alongside other servers without the patch, and thus far 3 out of 5 non-patched mongos instances have crashed in CursorCache::doTimeouts, and the patched server has been running fine (no noticeable mem leaks, db connections seem stable, etc).

BTW, it would be great to sneak in this patch (or a backport of SERVER-3002) into 1.8.4!

Best,
Ben

Comment by Carlos Rodriguez [ 04/Oct/11 ]

I think I'm running into the same bug, which periodically crashes our mongos servers when under high activity.

Mon Oct 3 07:33:40 [cursorTimeout] killing old cursor 5371030602413797610 idle for: 600102ms
Mon Oct 3 07:33:40 [cursorTimeout] killing old cursor 5935414358782317860 idle for: 600322ms
Received signal 11
Backtrace: 0x52f595 0x7f974b81ec20 0x7f974c0739c3 0x673de5 0x5266a3 0x5041eb 0x505aa4 0x6a7f70 0x7f974c314971 0x7f974b8d192d
/usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x52f595]
/lib/libc.so.6(+0x33c20)[0x7f974b81ec20]
/usr/lib/libstdc++.so.6(_ZSt18_Rb_tree_incrementPSt18_Rb_tree_node_base+0x13)[0x7f974c0739c3]
/usr/bin/mongos(_ZN5mongo11CursorCache10doTimeoutsEv+0x75)[0x673de5]
/usr/bin/mongos(_ZN5mongo4task4Task3runEv+0x33)[0x5266a3]
/usr/bin/mongos(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0x12b)[0x5041eb]
/usr/bin/mongos(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x74)[0x505aa4]
/usr/bin/mongos(thread_proxy+0x80)[0x6a7f70]
/lib/libpthread.so.0(+0x7971)[0x7f974c314971]
/lib/libc.so.6(clone+0x6d)[0x7f974b8d192d]
===
Mon Oct 3 07:33:41 CursorCache at shutdown - sharded: 196 passthrough: 7

Upgrading to 2.0 is not an option for us at this point, we tried it and there were disastrous performance problems. Would love to get this in the 1.8 branch.

@Ben, let me know how the patch is faring so far. I may have to use it if 1.8.4 doesn't land soon. Thanks

Comment by Ben Becker [ 02/Oct/11 ]

Please disregard the c++11 comment earlier; easier to change the for-loop to a while-loop and post increment the iterator in the call to erase(). Currently testing a custom build from the 1.8.2 tag with the attached patch applied. I will leave this running alongside vanilla v1.8.2 servers to verify this fixes the issue reported.

Generated at Thu Feb 08 03:04:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.