[SERVER-3755] mongos died unexpectedly Created: 03/Sep/11 Updated: 11/Jul/16 Resolved: 22/Nov/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 1.8.3 |
| Fix Version/s: | 2.1.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Theo Hultberg | Assignee: | Greg Studer |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Without any warnings or errors, two out of four mongos processes in our cluster suddenly died, all I can see in the mongos logs is "Received signal 6".
|
| Comments |
| Comment by Greg Studer [ 22/Nov/11 ] | ||||||||||||||
|
lots of fixes for collection change issues now. | ||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 03/Sep/11 ] | ||||||||||||||
|
I'm not sure a flushRouterConfig would solve everything, just might change it once again. I would really really recommend what I said before, either not dropping via mongos, or using 2.0.0-rc1 | ||||||||||||||
| Comment by Theo Hultberg [ 03/Sep/11 ] | ||||||||||||||
|
About the use case: we do ad analytics, and do somewhere around 10K inserts/s into a cluster of three shards on EC2. The data is not important after 24 hours, so to avoid filling up disks we want to throw it out. This can't be done without fragmenting the database, so we create partitions for each day, dropping old partitions. The partitions are set up for sharding, and all chunks are created up front. The balancer is off, because otherwise slaves wouldn't have a chance to keep up, the primaries can't deliver data to both secondaries and other shards (and until last week we used high memory quadruple extra large EC2 instances, even they couldn't keep up). The partitioning creates and drops databases, but we keep track of which databases exist and make sure not to try to write to any that don't. The problem is It's been a long way to this setup (including a health check with Kyle, some of the ideas came out of that), and we thought we finally had something that would work, when we ran straight into the can't-drop-databases-because-everything-dies bugs described here. Come to MongoUK September 19th and listen to my talk, I'll explain it all | ||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 03/Sep/11 ] | ||||||||||||||
|
Very sorry you ran into issues. The quickest thing to do is what I mentioned in another ticket, not doing dropDatabase via mongos, dropping via shard directly, and never re-using a db. Though I do want to get you onto 2.0.0 soon. There might be an issue, but overall it should be much more stable. | ||||||||||||||
| Comment by Theo Hultberg [ 03/Sep/11 ] | ||||||||||||||
|
I'm very wary of upgrading just to see if it solves the problem. We've been through 1.8.0, 1.8.1, 1.8.2 some RC's in between and now we're using 1.8.3. All have had critical sharding bugs and upgrading have solved some, but also brought new ones. Just the other day I reported On the other hand, 1.8.3 isn't viable either, so I'm seriously considering not using sharding at all at this point. The alternative is doing application side sharding or a simple consistent hashing solution. In fact we've already turned off the balancer, and create all chunks beforehand, so it's not too far from what we already do. I'm sorry to whine. I really, really want to use Mongo, and I like it alot, minus the sharding bugs. You're a star for answering bug reports early on a saturday. | ||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 03/Sep/11 ] | ||||||||||||||
|
I would highly highly recommend trying 2.0.0-rc1 - all your problems are related to dropDatabase and that should address all of them. | ||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 03/Sep/11 ] | ||||||||||||||
|
This is definitely caused by dropping the database and re-using. CAn you describe the use case a bit more? | ||||||||||||||
| Comment by David Tollmyr [ 03/Sep/11 ] | ||||||||||||||
|
We got several messages like "too many attempts to update config, failing". | ||||||||||||||
| Comment by Theo Hultberg [ 03/Sep/11 ] | ||||||||||||||
|
Now every time I restart my application at least one mongos dies. I found this in one of the mongos log since the last restart:
but it doesn't look like it's the cause of the death, things happen after that. |