[SERVER-6749] What exactly causes "can't find index in storeCurrentLocs" Created: 10/Aug/12 Updated: 10/Dec/14 Resolved: 20/Aug/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.0.7 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Remon van Vliet | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Participants: | |||||
| Description |
|
Every now and then we run into errors like these : Fri Aug 10 11:29:38 [Balancer] moveChunk result: { errmsg: "can't find index in storeCurrentLocs", ok: 0.0 }Fri Aug 10 11:29:38 [Balancer] balancer move failed: { errmsg: "can't find index in storeCurrentLocs", ok: 0.0 }On top of that my config.mongos collection is slowly filling up with documents so those are not cleaned up (should they be?). For example I see 71 documents there while there are 20 mongos connected currently. Any idea what can cause these sorts of issues? Can killing mongos processes cause this for example? |
| Comments |
| Comment by Kevin J. Rice [ 21/Jul/14 ] | |||||||||||||||||||||
|
Just had this happen. Noticed that chunk migration had stopped for some number of days. Last week, for various reasons, had to kill all monogd processes, some of which it urned out were rebuilding indexes at the time. this led to a huge hassle, had to repairDatabase on those shards / replicasets. Also, noticed that some indexs were missing after this repair. Had to apply so them agin with ensureIndex(). Flash forward to this week, kept seeing no balancing and these silly messages likedthe above "can't find index in storeCurrentLocs." SOLVED: reIndex didn't work, but I noticed that in th eoutput it listed indexes on each shard. Some of the shards were missing an index, wihch turneed out to be the shard key index. Have now issued a command, ensureIndex() for that shardkey index, and it has been rebuilt. Now not seeing error anymore. Note that I had to run dbRepirDatabase after this because had another error:
TO DO WORK for MONGO: When this error occurs, additional text in the error message, something like, "This could be caused by missing indexes on some shards." would be helpful. | |||||||||||||||||||||
| Comment by Remon van Vliet [ 15/Jul/13 ] | |||||||||||||||||||||
|
Today we had the same error which led to a catastrophic failure for one of our customers. Version 2.0.8. I will open a new issue once I have collected all logs and such but I wanted to be able to refer to this one and vice versa. Basically scenario is as follows in this case :
I'm aware this is an atypical scenario given this frequency of capacity changes but we're not doing anything beyond invoking admin commands. We suspect either the cluster is left in a broken state after downscale or the upscale itself provokes the errors ("can't find index in storeCurrentLocs"). Either way it is not able to move chunks to other shards but it did enough to produce inconsistencies in the data somehow (our user account count went from ~617k to ~890k without any users being added indicating some sort of duplication). In any case, I'll follow up with an issue in Community Private. | |||||||||||||||||||||
| Comment by Remon van Vliet [ 15/Aug/12 ] | |||||||||||||||||||||
|
Hi Spencer. Unfortunately the logs are gone for that period. It's on a test environment that doesn't back up the logs. I'll make sure we store all the mongos logs when it occurs again. For now I suppose you can close this issue since I cannot provide further information. I'll report an actual issue next time with more information. Thanks for the support | |||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 15/Aug/12 ] | |||||||||||||||||||||
|
Hi Remon - any update on getting that log data so we can continue investigating what went wrong? I'd like logs from the mongos that was seeing the problem, as well as from nodes that were the primaries at the time for each shard's replica set. As stated before, I'd like to get the time period covering the full time that you were seeing the problem, as well as several hours before it started and after it was resolved. | |||||||||||||||||||||
| Comment by Remon van Vliet [ 13/Aug/12 ] | |||||||||||||||||||||
|
I might be able to find the log data. It happens rather randomly. I understand we ideally shouldn't edit metadata but we need a fast solution and DNS tends to not be that. This seemed simplest but we'll look into alternatives. Still this seems like a state issue that shouldn't be allowed to happen. | |||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 10/Aug/12 ] | |||||||||||||||||||||
|
Do you know when exactly this started happening? Do you have logs from that time period? We'd like to have several hours worth of logs from before and after the problem appeared to help our debugging. We strongly discourage editing config data directly. Instead of using IP addresses in your configuration directly, it would be better to use elastic hostnames/DNS hostnames that can be updated when the IP addresses of the shards change, without having to modify the config data. | |||||||||||||||||||||
| Comment by Remon van Vliet [ 10/Aug/12 ] | |||||||||||||||||||||
|
Updating the config.shards collection so that the chunks are assigned to the correct (or wrong, i guess) shard kickstarts the balancing again. So basically the most important question is how do we end up in this state in the first place. | |||||||||||||||||||||
| Comment by Remon van Vliet [ 10/Aug/12 ] | |||||||||||||||||||||
|
And more info, it seems that the data is on the wrong shard. If we connect to shard0006 directly we find 26 million users there but the chunks collection has all those chunks on shard0000. The only somewhat sketchy thing we do is that we overwrite the IP addresses of the config.shards collection with the new IPs if the shard machines (we have a rather unique usecase where we bring up large clusters for a TV show and bring them down again, when they go up EC2 assigns new IP addresses hence this "hack". It could be solved with naming). This is done when all mongos processes are down. The IP addresses are correct when this error occurs. | |||||||||||||||||||||
| Comment by Remon van Vliet [ 10/Aug/12 ] | |||||||||||||||||||||
|
Related aborts in config.changelog : { "_id" : "xxxxxx.eu-west-1.compute.internal-2012-08-10T09:10:34-1", "server" : "xxxxxx.eu-west-1.compute.internal", "clientAddr" : "x.x.x.x:60276", "time" : ISODate("2012-08-10T09:10:34.741Z"), "what" : "moveChunk.from", "ns" : "user.user.users", "details" : { "min" : { "_h" : 417566254 }, "max" : { "_h" : 447392415 }, "step1" : 800, "step2" : 604, "note" : "aborted" } } , "max" : { "_h" : 447392415 }, "step1" : 0, "step2" : 341, "note" : "aborted" } } | |||||||||||||||||||||
| Comment by Remon van Vliet [ 10/Aug/12 ] | |||||||||||||||||||||
|
To add to this, I think the issue is related to a stale lock (either cause or consequence) : The collection that is no longer balancing is found in the config.locks collection and the same migrate keeps looping between state 0 and state 2 but as shown in the error never succeeds : { "_id" : "user.user.users", "process" : "xxxxxx.eu-west-1.compute.internal:27018:1344589833:30611607", "state" : 2, "ts" : ObjectId("5024dab6618939edbeef7e0e"), "when" : ISODate("2012-08-10T09:56:06.492Z"), "who" : "xxxxxx.eu-west-1.compute.internal:27018:1344589833:30611607:conn36:820470076", "why" : "migrate- { _h: 417566254 }" } " } |