[SERVER-5302] Mongos Process Dying Signal 11 Created: 14/Mar/12  Updated: 15/Aug/12  Resolved: 20/Mar/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: a Rob Assignee: Randolph Tan
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

EC2 Linux 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008 x86_64 x86_64 x86_64 GNU/Linux, running paster servers, pylons web framework, nginx as a reverse proxy


Attachments: Text File mongo2.log    
Issue Links:
Depends
depends on SERVER-5110 ReplicaSetMonitor::check not thread s... Closed
Operating System: ALL
Participants:

 Description   

Mongos Process constantly dies due to unknown cause. Usually occurs when system is under load. Log files show a SIGSEGV (Signal 11)

Possibly Related to SERVER-4699, though we're already running 2.0.3



 Comments   
Comment by Ian Whalen (Inactive) [ 30/Jul/12 ]

if that's the case please open a new server ticket with description and relevant details.

Comment by Travis Reeder [ 30/Jul/12 ]

Ok, thanks Ian. We're having bad problems with mongos failing right now under heavy load with 2.0.6.

Comment by Ian Whalen (Inactive) [ 30/Jul/12 ]

@travis, it did - you can check SERVER-5110 for the relevant commits in github.

Comment by Travis Reeder [ 30/Jul/12 ]

Did this fix make it into 2.0.5 / 2.0.6?

Comment by Eliot Horowitz (Inactive) [ 20/Mar/12 ]

actual bug is SERVER-5110
working on fix for 2.0.5

Comment by Andrew Levy [ 20/Mar/12 ]

I just want to emphasize how critical this issue is – we can't maintain our infrastructure with our mongos instances dying. It puts more stress on our other application servers which eventually die as the load becomes too much to handle. Do you have any estimate on a fix? Thanks!

Comment by Randolph Tan [ 19/Mar/12 ]

It looks like this is related to https://jira.mongodb.org/browse/SERVER-5110.

Comment by a Rob [ 19/Mar/12 ]

yes

Comment by Randolph Tan [ 19/Mar/12 ]

Sorry for being unclear, what I meant to ask was were you also seeing the "got not master" in the other crashes?

Comment by a Rob [ 19/Mar/12 ]

Yes, right before the signal 11, there is a:
Sat Mar 3 04:34:45 [conn459471] got not master for: shard6

We've experienced this crash many times.

We're also not using a binary we've built - we just renamed it mongo32.

Comment by Randolph Tan [ 19/Mar/12 ]

Hi,

Were you able to experience this kind of crash more than once? If yes, do you also see a "got not master" in the logs just before the crash?

Comment by a Rob [ 19/Mar/12 ]
  1. addr2line -fC -e mongos32 0x225420 0x8366ae2 0x840682b 0x8406fe7 0x841aa13 0x8223df9 0x367762 0x44dd7e
    ??
    ??:0
    std::_Rb_tree<std::string, std::pair<std::string const, mongo::DBConfig::CollectionInfo>, std::_Select1st<std::pair<std::string const, mongo::DBConfig::CollectionInfo> >, std::less<std::string>, std::allocator<std::pair<std::string const, mongo::DBConfig::CollectionInfo> > >::erase(std::string const&)
    ??:0
    mongo::dbgrid_pub_cmds::CountCmd::run(std::string const&, mongo::BSONObj&, int, std::string&, mongo::BSONObjBuilder&, bool)
    ??:0
    mongo::dbgrid_pub_cmds::CountCmd::run(std::string const&, mongo::BSONObj&, int, std::string&, mongo::BSONObjBuilder&, bool)
    ??:0
    mongo::NoAdminAccess::getAdminUser(std::string const&) const
    ??:0
    mongo::Command::logIfSlow(mongo::Timer const&, std::string const&)
    ??:0
    ??
    ??:0
    ??
    ??:0
  1. addr2line -fC -e mongos32 0x225420 0x81340aa 0x813df94 0x813eb9c 0x8144766 0x8387e63 0x83bb211 0x840717e 0x841aa23 0x8223df9 0x367762 0x44dd7e
    ??
    ??:0
    mongo::FieldRangeSet::pattern(mongo::BSONObj const&) const
    ??:0
    mongo::FieldRangeVector::FieldRangeVector(mongo::FieldRangeSet const&, mongo::IndexSpec const&, int)
    ??:0
    mongo::FieldRangeVector::FieldRangeVector(mongo::FieldRangeSet const&, mongo::IndexSpec const&, int)
    ??:0
    mongo::FieldRange::FieldRange(mongo::BSONElement const&, bool, bool, bool)
    ??:0
    mongo::ChunkManager::findChunk(mongo::BSONObj const&) const
    ??:0
    mongo::Chunk::getShard() const
    ??:0
    mongo::dbgrid_pub_cmds::CountCmd::run(std::string const&, mongo::BSONObj&, int, std::string&, mongo::BSONObjBuilder&, bool)
    ??:0
    mongo::NoAdminAccess::getAdminUser(std::string const&) const
    ??:0
    mongo::Command::logIfSlow(mongo::Timer const&, std::string const&)
    ??:0
    ??
    ??:0
    ??
    ??:0
Comment by Randolph Tan [ 17/Mar/12 ]

It looks like you built your own mongos binary, is that correct? Can you try running this on that binary?

addr2line -fC -e mongos 0x225420 0x8366ae2 0x840682b 0x8406fe7 0x841aa13 0x8223df9 0x367762 0x44dd7e
addr2line -fC -e mongos 0x225420 0x81340aa 0x813df94 0x813eb9c 0x8144766 0x8387e63 0x83bb211 0x840717e 0x841aa23 0x8223df9 0x367762 0x44dd7e

Thanks!

Comment by a Rob [ 15/Mar/12 ]

Sorry, we're running both the 32bit binary on a 32bit system and the 64bit binary on a 64bit system. The attached logs are from a 32bit system:

System:
Linux 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686 i686 i386 GNU/Linux

Binary:
Thu Mar 15 12:09:08 ./mongos32 db version v2.0.3, pdfile version 4.5 starting (--help for usage)
Thu Mar 15 12:09:08 git version: 05bb8aa793660af8fce7e36b510ad48c27439697
Thu Mar 15 12:09:08 build info: Linux domU-12-31-39-01-70-B4 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686 BOOST_LIB_VERSION=1_41

Comment by Randolph Tan [ 15/Mar/12 ]

Hi, can you provide us with the exact OS and version of the mongos you were using for the attached log. Specifically, are you using the rc releases? You listed the environment as 64bit linux, it appears that you are using the 32bit binary. Is that correct?

Generated at Thu Feb 08 03:08:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.