[SERVER-5433] Stale config and unable to move chunks Created: 28/Mar/12 Updated: 15/Aug/12 Resolved: 04/Apr/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Preetham Derangula | Assignee: | Greg Studer |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | Linux |
| Participants: |
| Description |
|
History |
| Comments |
| Comment by Greg Studer [ 04/Apr/12 ] | |
|
Closing for now, feel free to open new question tickets or use the groups for new issues. | |
| Comment by Greg Studer [ 03/Apr/12 ] | |
|
Generally mongoses are placed on the app server - this distributes the (small) mongos load evenly as you add new app servers. Generally shards are optimized for low latency and high i/o, and react poorly to competing for resources with other processes, so ideally they would be standalone as well. Config servers very low latency, but don't store much data, so should be centrally located machines with high bandwidth. | |
| Comment by Preetham Derangula [ 02/Apr/12 ] | |
|
>"Config servers only hold metadata - and they shouldn't be handling migrations (unless your config server is also a shard - not recommended). How exactly is your cluster set up, and on what hardware? You'll need to make sure that your ulimits are correctly set for fds and processes on every server, and the config servers (you should use three) should ideally be on dedicated machines. All the mongoses will need to be able to communicate with all the config servers and shards, and all shards will need to be able to communicate with each other and the config servers" | |
| Comment by Greg Studer [ 30/Mar/12 ] | |
|
> If all the config files are deleted and the whole shard network is started, would it function normally? | |
| Comment by Preetham Derangula [ 30/Mar/12 ] | |
|
Thanks for the advice! | |
| Comment by Greg Studer [ 30/Mar/12 ] | |
This error is pretty clear - you're out of disk space on one of the machines, and therefore writes and indirectly the balancer have been stopped. The balancer has probably been stopped for awhile for other reasons, which is why data is building up differently between different shards. db.printShardingStatus() failing with a socket exception still indicates there are connectivity problems between machines. This may be why balancing stopped and your data built up in the first place. To get back to normal, you need to increase the disk space on the "full" machine and track down the connectivity problem (the mongos/mongod logs can help you here). Then you need to let the data in the sharded collection balance. Be aware that the balancing rate can/will be slower than the rate of inserts if you're pushing lots of data into MongoDB, you'll need to monitor your system to determine if this is occurring. MMS (mms.10gen.com) is useful for these kinds of tasks. | |
| Comment by Preetham Derangula [ 30/Mar/12 ] | |
|
Now , I have checked disk space allotted on all the machines. One of the machines seems to have lots of data on it, when compared with others. I see in the log a problem related to that. , errmsg: "moveChunk failed to engage TO-shard in the data transfer: db assertion failure", ok: 0.0 } from: shard6 to: shard5 chunk: { id: "audit.tsauditentry-UID"23850964410"", lastmod: Timestamp 83000|1, ns: "audit.tsauditentry", min: { UID: "23850964410" }, max: { UID: "23851209616" }, shard: "shard6" } "] db.stats ----> out put , , , , , , }, Thanks a lot , | |
| Comment by Greg Studer [ 30/Mar/12 ] | |
|
Also, while 2.1.0 is a dev build and not recommended for production, the latest stable mongodb is 2.0.4, so I'd use that for new systems. | |
| Comment by Greg Studer [ 30/Mar/12 ] | |
|
> We had to turnoff config's on other machines as router was throwing stale config exceptions and we wanted to get rid of the config machine thats being complained about. That's not what the stale config exceptions mean - they're normal in most cases (though shouldn't be showing up in your app) and refer to the shards, not to the config servers. Basically they mean that the mongos needs to refresh it's config info, which happens from time to time. > Question: Is it possible to thrash config and start from scratch without losing the data? Is that a right direction? I think the next step here is to restart all your mongos and mongod processes, and then run db.printShardingStatus() from the mongos. I don't know what hardware you're running on either, but if the mongod shards and config server are competing for resources on a limited system this could cause strange issues. | |
| Comment by Preetham Derangula [ 30/Mar/12 ] | |
|
We initially had shard,router and config running on the same machine, all of them on linux box/es- each machine is a VM. After seeing some problems on the router,which I will explain earlier in the issue, we turned off configs on the router machine. Now, Router,shard and config run on one machine and other instances have only shards running.We had to turnoff config's on other machines as router was throwing stale config exceptions and we wanted to get rid of the config machine thats being complained about. The following is the ulimit -a output and its similiar on each machine. | |
| Comment by Greg Studer [ 30/Mar/12 ] | |
|
Also, mongodb won't prevent you from running out of space even with balancing on - it will try to spread load across the cluster, but it won't be perfect and you'll need to ensure you have enough room for the data. | |
| Comment by Greg Studer [ 30/Mar/12 ] | |
|
I don't think upgrading will solve your problem, it seems like there's a configuration issue. > After this one of the config servers was holding lot of data in the moveChunk folder and has exhausted all the space Config servers only hold metadata - and they shouldn't be handling migrations (unless your config server is also a shard - not recommended). How exactly is your cluster set up, and on what hardware? You'll need to make sure that your ulimits are correctly set for fds and processes on every server, and the config servers (you should use three) should ideally be on dedicated machines. All the mongoses will need to be able to communicate with all the config servers and shards, and all shards will need to be able to communicate with each other and the config servers. | |
| Comment by Preetham Derangula [ 30/Mar/12 ] | |
|
All the servers are having limit of 65,000+. That shouldn't be an issue | |
| Comment by Scott Hernandez (Inactive) [ 30/Mar/12 ] | |
|
Please check each server and make sure your ulimit -n is about a few thousand. That is most likely the issue. You will need to restart each instance to have changes applied to the process. | |
| Comment by Preetham Derangula [ 30/Mar/12 ] | |
|
Some more history of the problem.I have 7 sharded collections and one of the collections is receiving data randomly. This collection is fairly big(65GB), when compared with other collections.This happens once I restart the router and shards.After some time I get this error, "out of file descriptors" and all the collections wont receive any writes. Fri Mar 30 04:45:31 [Balancer] chose [shard6] to [shard5] { id: "audit.tsauditentry-UID"23850964410"", lastmod: Timestamp 83000|1, ns: "audit.tsauditentry", min: { UID: "23850964410" }, max: { UID: "23851209616" }, shard: "shard6" } max: { UID: "23851209616" }) shard6:LRCHB00364:40006 -> shard5:LRCHB00363:40005 , errmsg: "moveChunk failed to engage TO-shard in the data transfer: db assertion failure", ok: 0.0 } , errmsg: "moveChunk failed to engage TO-shard in the data transfer: db assertion failure", ok: 0.0 } from: shard6 to: shard5 chunk: { id: "audit.tsauditentry-UID"23850964410"", lastmod: Timestamp 83000|1, ns: "audit.tsauditentry", min: { UID: "23850964410" }, max: { UID: "23851209616" }, shard: "shard6" } , max: { UID: "23851209616" }, shard: "shard6" } max: { UID: "23851209616" }) shard6:LRCHB00364:40006 -> shard5:LRCHB00363:40005 , errmsg: "moveChunk failed to engage TO-shard in the data transfer: db assertion failure", ok: 0.0 } , errmsg: "moveChunk failed to engage TO-shard in the data transfer: db assertion failure", ok: 0.0 } from: shard6 to: shard5 chunk: { id: "audit.tsauditentry-UID"23850964410"", lastmod: Timestamp 83000|1, ns: "audit.tsauditentry", min: { UID: "23850964410" }, max: { UID: "23851209616" }, shard: "shard6" } , max: { UID: "23851209616" }, shard: "shard6" } max: { UID: "23851209616" }) shard6:LRCHB00364:40006 -> shard5:LRCHB00363:40005 , errmsg: "moveChunk failed to engage TO-shard in the d | |
| Comment by Eliot Horowitz (Inactive) [ 30/Mar/12 ] | |
|
Can you attach the logs or at least a sample of the errors? |