Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-3996

Backport fix for SERVER-3002 to v1.8 branch

    • Type: Icon: Task Task
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 1.8.4
    • Affects Version/s: 1.8.2
    • Component/s: Sharding
    • Environment:
      Ubuntu 10.04
      EC2 large instance
      mongos v1.8.2 (from 10gen apt source)
      php driver v1.0.11
      mongos is connected to 2 shards along with 7 other instances of mongos on other hosts.

      We seem to be encountering this issue in v1.8.2 and wanted to see about backporting the fix for SERVER-3002 to the v1.8 branch. Looking at the v1.8 branch, I do not see a similar fix for calling the increment operator on an erased iterator: https://github.com/mongodb/mongo/blob/v1.8/s/cursors.cpp#L268.

      The fix in the v2 (related to SERVER-3002) starts iterating again from the beginning: https://github.com/mongodb/mongo/blob/v2.0/s/cursors.cpp#L277

      While that fix is perfectly valid and seems like it could easily be backported, I was also curious a c++11 feature like map::erase()'s return iterator is allowed in the current Mongo source. My thinking is we can replace the fix here:

      _cursors.erase( i );
      i = _cursors.begin(); // possible 2nd entry will get skipped, will get on next pass

      with:

      i = _cursors.erase;

      Though perhaps I'm missing another reason where it's better to start from the begin()nig... I assumed the scoped lock for _mutex would prevent any updates to the _cursors map, and it looks like the timeout check uses the 'now' value acquired before we start iterating. Tangentially, it looks like CursorCache::~CursorCache() doesn't acquire the _mutex lock before checking _cursors.size(), though I'm unsure if this is actually an issue.

      Anyway, the issue we're seeing just started recently, and has been observed on 5 out of 6 servers that run mongos (all nearly identical ec2 instances). Here's the portion of the log file that contains the error, and apologies in advance for the lack of verbose output (occurs only on our production servers, but I can enable verbose logging if it will help).

      Sat Oct 1 05:00:16 [cursorTimeout] killing old cursor 3588663744748048245 idle for: 600028ms
      Received signal 11
      Backtrace: 0x52f8f5 0x7f65d137aaf0 0x7f65d1bd6533 0x6711c5 0x526a03 0x50454b 0x505e04 0x6a50a0 0x7f65d1e7e9ca 0x7f65d142d70d
      /usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x52f8f5]
      /lib/libc.so.6(+0x33af0)[0x7f65d137aaf0]
      /usr/lib/libstdc++.so.6(_ZSt18_Rb_tree_incrementPSt18_Rb_tree_node_base+0x13)[0x7f65d1bd6533]
      /usr/bin/mongos(_ZN5mongo11CursorCache10doTimeoutsEv+0x75)[0x6711c5]
      /usr/bin/mongos(_ZN5mongo4task4Task3runEv+0x33)[0x526a03]
      /usr/bin/mongos(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0x12b)[0x50454b]
      /usr/bin/mongos(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x7
      4)[0x505e04]
      /usr/bin/mongos(thread_proxy+0x80)[0x6a50a0]
      /lib/libpthread.so.0(+0x69ca)[0x7f65d1e7e9ca]
      /lib/libc.so.6(clone+0x6d)[0x7f65d142d70d]
      ===
      Received signal 11
      Backtrace: 0x52f8f5 0x7f65d137aaf0 0x532ea0 0x577654 0x577c71 0x630a9e 0x6361c8 0x66841c 0x67d187 0x580b7c 0x6a50a0 0x7f65d1e7e9ca 0x7f65d142d70d
      /usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x52f8f5]
      /lib/libc.so.6(+0x33af0)[0x7f65d137aaf0]
      /usr/bin/mongos(_ZN5mongo16DBConnectionPool11onHandedOutEPNS_12DBClientBaseE+0x20)[0x532ea0]
      /usr/bin/mongos(_ZN5mongo15ShardConnection5_initEv+0x1b4)[0x577654]
      /usr/bin/mongos(_ZN5mongo15ShardConnectionC1ERKNS_5ShardERKSs+0xa1)[0x577c71]
      /usr/bin/mongos(_ZN5mongo8Strategy7doQueryERNS_7RequestERKNS_5ShardE+0x4e)[0x630a9e]
      /usr/bin/mongos(_ZN5mongo14SingleStrategy7queryOpERNS_7RequestE+0x4d8)[0x6361c8]
      /usr/bin/mongos(_ZN5mongo7Request7processEi+0x29c)[0x66841c]
      /usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x77)[0x67d187]
      Sat Oct 1 05:00:16 CursorCache at shutdown - sharded: 543 passthrough: 0

      If this segv appears unrelated to SERVER-3002, please let me know and I'll report more information. Thank you for your time!

            Assignee:
            greg.mckeon@mongodb.com Gregory McKeon (Inactive)
            Reporter:
            benjamin.becker Ben Becker
            Votes:
            3 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: