[SERVER-3755] mongos died unexpectedly Created: 03/Sep/11  Updated: 11/Jul/16  Resolved: 22/Nov/11

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 1.8.3
Fix Version/s: 2.1.0

Type: Bug Priority: Major - P3
Reporter: Theo Hultberg Assignee: Greg Studer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-3754 Seeing lots of setShardVersion failed Closed
is duplicated by SERVER-4850 SetShardVersion failed: client versio... Closed
Related
related to SERVER-3753 "db config reload failed" Closed
Operating System: ALL
Participants:

 Description   

Without any warnings or errors, two out of four mongos processes in our cluster suddenly died, all I can see in the mongos logs is "Received signal 6".

Sat Sep  3 07:22:44 [conn5] ns: fragments_20110831.exposure_fragments Strategy::doQuery attempt: 0
Sat Sep  3 07:22:44 [conn5] ns: fragments_20110831.exposure_fragments Strategy::doQuery attempt: 1
Sat Sep  3 07:22:45 [conn5] ns: fragments_20110831.exposure_fragments Strategy::doQuery attempt: 2
Sat Sep  3 07:22:47 [conn5] ns: fragments_20110831.exposure_fragments Strategy::doQuery attempt: 3
Received signal 6
Backtrace: 0x52f595 0x7f7e017e0af0 0x7f7e017e0a75 0x7f7e017e45c0 0x7f7e017d9941 0x69eb1c 0x5041eb 0x505aa4 0x6a7f70 0x
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x52f595]
/lib/libc.so.6(+0x33af0)[0x7f7e017e0af0]
/lib/libc.so.6(gsignal+0x35)[0x7f7e017e0a75]
/lib/libc.so.6(abort+0x180)[0x7f7e017e45c0]
/lib/libc.so.6(__assert_fail+0xf1)[0x7f7e017d9941]
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo17WriteBackListener3runEv+0x162c)[0x69eb1c]
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0x12b)[0x5041eb]
/opt/mongodb-1.8.3/bin/mongos(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10
/opt/mongodb-1.8.3/bin/mongos(thread_proxy+0x80)[0x6a7f70]
/lib/libpthread.so.0(+0x69ca)[0x7f7e022e49ca]
/lib/libc.so.6(clone+0x6d)[0x7f7e0189370d]
===
Received signal 6
Backtrace: 0x52f595 0x7f7e017e0af0 0x7f7e017e0a75 0x7f7e017e45c0 0x7f7e017d9941 0x530655 0x5c1ca3 0x581601 0x7f7e022e4

Sat Sep  3 07:21:43 [conn6] creating ChunkManager ns: _id: "complete_20110902.exposures" took: 8ms sequenceNumber: 15
Sat Sep  3 07:21:43 [conn6] creating ChunkManager ns: _id: "complete_20110902.pageviews" took: 2ms sequenceNumber: 16
Received signal 6
Backtrace: 0x52f595 0x7f851b1f0af0 0x7f851b1f0a75 0x7f851b1f45c0 0x7f851b1e9941 0x69eb1c 0x5041eb 0x505aa4 0x6a7f70 0x
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x52f595]
/lib/libc.so.6(+0x33af0)[0x7f851b1f0af0]
/lib/libc.so.6(gsignal+0x35)[0x7f851b1f0a75]
/lib/libc.so.6(abort+0x180)[0x7f851b1f45c0]
/lib/libc.so.6(__assert_fail+0xf1)[0x7f851b1e9941]
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo17WriteBackListener3runEv+0x162c)[0x69eb1c]
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0x12b)[0x5041eb]
/opt/mongodb-1.8.3/bin/mongos(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10
/opt/mongodb-1.8.3/bin/mongos(thread_proxy+0x80)[0x6a7f70]
/lib/libpthread.so.0(+0x69ca)[0x7f851bcf49ca]
/lib/libc.so.6(clone+0x6d)[0x7f851b2a370d]
===
Received signal 11
Backtrace: 0x52f595 0x7f851b1f0af0 0x532b40 0x57affc 0x5778f5 0x577da8 0x633240 0x63d6c9 0x63fec0 0x66b032 0x67fda7 0x
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x52f595]
/lib/libc.so.6(+0x33af0)[0x7f851b1f0af0]
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo16DBConnectionPool11onHandedOutEPNS_12DBClientBaseE+0x20)[0x532b40]
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo17ClientConnections3getERKSsS2_b+0x1ac)[0x57affc]
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo15ShardConnection5_initEb+0x65)[0x5778f5]
Received signal 11
Backtrace: /opt/mongodb-1.8.3/bin/mongos(_ZN5mongo15ShardConnectionC1ERKNS_5ShardERKSsb+0xa8)[0x577da8]
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo8Strategy6insertERKNS_5ShardEPKcRKNS_7BSONObjE+0x60)[0x633240]
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo13ShardStrategy7_insertERNS_7RequestERNS_9DbMessageEN5boost10shared_ptrINS_12Ch
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo13ShardStrategy7writeOpEiRNS_7RequestE+0x260)[0x63fec0]
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo7Request7processEi+0x172)[0x66b032]
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9L
/opt/mongodb-1.8.3/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x34d)[0x5815ed]
/lib/libpthread.so.0(+0x69ca)[0x7f851bcf49ca]
/lib/libc.so.6(clone+0x6d)[0x7f851b2a370d]
===
0x52f595 0x7f851b1f0af0



 Comments   
Comment by Greg Studer [ 22/Nov/11 ]

lots of fixes for collection change issues now.

Comment by Eliot Horowitz (Inactive) [ 03/Sep/11 ]

I'm not sure a flushRouterConfig would solve everything, just might change it once again.

I would really really recommend what I said before, either not dropping via mongos, or using 2.0.0-rc1

Comment by Theo Hultberg [ 03/Sep/11 ]

About the use case: we do ad analytics, and do somewhere around 10K inserts/s into a cluster of three shards on EC2. The data is not important after 24 hours, so to avoid filling up disks we want to throw it out. This can't be done without fragmenting the database, so we create partitions for each day, dropping old partitions. The partitions are set up for sharding, and all chunks are created up front. The balancer is off, because otherwise slaves wouldn't have a chance to keep up, the primaries can't deliver data to both secondaries and other shards (and until last week we used high memory quadruple extra large EC2 instances, even they couldn't keep up).

The partitioning creates and drops databases, but we keep track of which databases exist and make sure not to try to write to any that don't. The problem is SERVER-1726, which makes it look like previously dropped databases are still around. Perhaps dropping the databases through each shard directly could work, but wouldn't mongos still report the database as existing? Would a flushRouterConfig take care of that?

It's been a long way to this setup (including a health check with Kyle, some of the ideas came out of that), and we thought we finally had something that would work, when we ran straight into the can't-drop-databases-because-everything-dies bugs described here.

Come to MongoUK September 19th and listen to my talk, I'll explain it all

Comment by Eliot Horowitz (Inactive) [ 03/Sep/11 ]

Very sorry you ran into issues.
I think the new ones mostly were just different ways the root issue was handled.
The main issue is that dropDatabase wasn't working correctly.
It manifested differently in every version, so upgrading never really solved anything, just made it appear different.

The quickest thing to do is what I mentioned in another ticket, not doing dropDatabase via mongos, dropping via shard directly, and never re-using a db.

Though I do want to get you onto 2.0.0 soon. There might be an issue, but overall it should be much more stable.

Comment by Theo Hultberg [ 03/Sep/11 ]

I'm very wary of upgrading just to see if it solves the problem. We've been through 1.8.0, 1.8.1, 1.8.2 some RC's in between and now we're using 1.8.3. All have had critical sharding bugs and upgrading have solved some, but also brought new ones.

Just the other day I reported SERVER-3739, for which the suggested fix was to update to 1.8.3, now when we're at 1.8.3 the suggested fix is upgrading to 2.0.0-RC1. I'm sure 2.0 fixes a lot of problems, but considering what we've been through the last few months I'm sure there will also be a number of new showstoppers.

On the other hand, 1.8.3 isn't viable either, so I'm seriously considering not using sharding at all at this point. The alternative is doing application side sharding or a simple consistent hashing solution. In fact we've already turned off the balancer, and create all chunks beforehand, so it's not too far from what we already do.

I'm sorry to whine. I really, really want to use Mongo, and I like it alot, minus the sharding bugs. You're a star for answering bug reports early on a saturday.

Comment by Eliot Horowitz (Inactive) [ 03/Sep/11 ]

I would highly highly recommend trying 2.0.0-rc1 - all your problems are related to dropDatabase and that should address all of them.

Comment by Eliot Horowitz (Inactive) [ 03/Sep/11 ]

This is definitely caused by dropping the database and re-using.

CAn you describe the use case a bit more?

Comment by David Tollmyr [ 03/Sep/11 ]

We got several messages like "too many attempts to update config, failing".
Tried restarting config servers, no change. Restarted mongos after that, no change.
Finally bounced the primaries of the shard replica sets and now everything seems to work again (for now).

Comment by Theo Hultberg [ 03/Sep/11 ]

Now every time I restart my application at least one mongos dies. I found this in one of the mongos log since the last restart:

Sat Sep  3 08:15:48 [conn5] warning: adding shard sub-connection richcolldb02 (parent richcollshard1/richcolldb02,rich
0x57b422 0x5778f5 0x577afd 0x66fede 0x640ea5 0x57e97c 0x638b52 0x66b15c 0x67fda7 0x5815ed 0x7f04ef3d69ca 0x7f04ee98570
 /opt/mongodb-1.8.3/bin/mongos(_ZN5mongo17ClientConnections3getERKSsS2_b+0x5d2) [0x57b422]
 /opt/mongodb-1.8.3/bin/mongos(_ZN5mongo15ShardConnection5_initEb+0x65) [0x5778f5]
 /opt/mongodb-1.8.3/bin/mongos(_ZN5mongo15ShardConnectionC1ERKSsS2_b+0x7d) [0x577afd]
 /opt/mongodb-1.8.3/bin/mongos(_ZN5mongo10ClientInfo12getLastErrorERKNS_7BSONObjERNS_14BSONObjBuilderEb+0x249e) [0x66f
 /opt/mongodb-1.8.3/bin/mongos(_ZN5mongo11dbgrid_cmds23CmdShardingGetLastError3runERKSsRNS_7BSONObjERSsRNS_14BSONObjBu
 /opt/mongodb-1.8.3/bin/mongos(_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderE+0x67c) [0
 /opt/mongodb-1.8.3/bin/mongos(_ZN5mongo14SingleStrategy7queryOpERNS_7RequestE+0x262) [0x638b52]
 /opt/mongodb-1.8.3/bin/mongos(_ZN5mongo7Request7processEi+0x29c) [0x66b15c]
 /opt/mongodb-1.8.3/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9
 /opt/mongodb-1.8.3/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x34d) [0x5815ed]
 /lib/libpthread.so.0(+0x69ca) [0x7f04ef3d69ca]
 /lib/libc.so.6(clone+0x6d) [0x7f04ee98570d]

but it doesn't look like it's the cause of the death, things happen after that.

Generated at Thu Feb 08 03:03:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.