[SERVER-6749] What exactly causes "can't find index in storeCurrentLocs" Created: 10/Aug/12  Updated: 10/Dec/14  Resolved: 20/Aug/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.0.7
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Remon van Vliet Assignee: Spencer Brody (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Participants:

 Description   

Every now and then we run into errors like these :

Fri Aug 10 11:29:38 [Balancer] moveChunk result:

{ errmsg: "can't find index in storeCurrentLocs", ok: 0.0 }

Fri Aug 10 11:29:38 [Balancer] balancer move failed:

{ errmsg: "can't find index in storeCurrentLocs", ok: 0.0 }

On top of that my config.mongos collection is slowly filling up with documents so those are not cleaned up (should they be?). For example I see 71 documents there while there are 20 mongos connected currently.

Any idea what can cause these sorts of issues? Can killing mongos processes cause this for example?



 Comments   
Comment by Kevin J. Rice [ 21/Jul/14 ]

Just had this happen. Noticed that chunk migration had stopped for some number of days. Last week, for various reasons, had to kill all monogd processes, some of which it urned out were rebuilding indexes at the time. this led to a huge hassle, had to repairDatabase on those shards / replicasets.

Also, noticed that some indexs were missing after this repair. Had to apply so them agin with ensureIndex().

Flash forward to this week, kept seeing no balancing and these silly messages likedthe above "can't find index in storeCurrentLocs."

SOLVED: reIndex didn't work, but I noticed that in th eoutput it listed indexes on each shard. Some of the shards were missing an index, wihch turneed out to be the shard key index. Have now issued a command, ensureIndex() for that shardkey index, and it has been rebuilt. Now not seeing error anymore.

Note that I had to run dbRepirDatabase after this because had another error:

2014-07-21T10:58:20.941-0500 [conn43697] myDBnameHere:myCollectionHere  0xf812a6 0xf2f092 0xf18f47 0xf190cc 0x7b9a0b 0xb86276 0xb8cd17 0x99efb0 0x9a1769 0x9a22da 0x9a2a75 0x9a5d80 0x9b4174 0x9b5753 0x9b645e 0xc4da55 0xac964e 0xace300 0x7bae2f 0xf3f429 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xf812a6]
 /usr/bin/mongod(_ZN5mongo10logContextEPKc+0x1a2) [0xf2f092]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x137) [0xf18f47]
 /usr/bin/mongod() [0xf190cc]
 /usr/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x2cb) [0x7b9a0b]
 /usr/bin/mongod(_ZN5mongo6updateERKNS_13UpdateRequestEPNS_7OpDebugEPNS_12UpdateDriverEPNS_14CanonicalQueryE+0x1e86) [0xb86276]
 /usr/bin/mongod(_ZN5mongo14UpdateExecutor7executeEv+0x67) [0xb8cd17]
 /usr/bin/mongod() [0x99efb0]
 /usr/bin/mongod(_ZN5mongo18WriteBatchExecutor10execUpdateERKNS_12BatchItemRefEPNS_7BSONObjEPPNS_16WriteErrorDetailE+0x69) [0x9a1769]
 /usr/bin/mongod(_ZN5mongo18WriteBatchExecutor11bulkExecuteERKNS_21BatchedCommandRequestEPSt6vectorIPNS_19BatchedUpsertDetailESaIS6_EEPS4_IPNS_16WriteErrorDetailESaISB_EE+0x24a) [0x9a22da]
 /usr/bin/mongod(_ZN5mongo18WriteBatchExecutor12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x615) [0x9a2a75]
 /usr/bin/mongod(_ZN5mongo8WriteCmd3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x410) [0x9a5d80]
 /usr/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9b4174]
 /usr/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xad3) [0x9b5753]
 /usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x2be) [0x9b645e]
 /usr/bin/mongod(_ZN5mongo11newRunQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0xa55) [0xc4da55]
 /usr/bin/mongod() [0xac964e]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x4b0) [0xace300]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xef) [0x7bae2f]
 /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x5a9) [0xf3f429]

TO DO WORK for MONGO: When this error occurs, additional text in the error message, something like, "This could be caused by missing indexes on some shards." would be helpful.

Comment by Remon van Vliet [ 15/Jul/13 ]

Today we had the same error which led to a catastrophic failure for one of our customers. Version 2.0.8. I will open a new issue once I have collected all logs and such but I wanted to be able to refer to this one and vice versa. Basically scenario is as follows in this case :

  • Every other day we scale a cluster up from 1 to many shards (10 in this case) at noon, and back down to 1 shard in the early morning.
  • For the upscale we invoke X (9 in this case) addShard commands concurrently and allow the balancer to redistribute chunks
  • For the downscale we invoke X (9 in tnis case) removeShards sequentially (so, one at a time to avoid another apparent mongos/config state issue)
  • Version is 2.0.8

I'm aware this is an atypical scenario given this frequency of capacity changes but we're not doing anything beyond invoking admin commands. We suspect either the cluster is left in a broken state after downscale or the upscale itself provokes the errors ("can't find index in storeCurrentLocs"). Either way it is not able to move chunks to other shards but it did enough to produce inconsistencies in the data somehow (our user account count went from ~617k to ~890k without any users being added indicating some sort of duplication).

In any case, I'll follow up with an issue in Community Private.

Comment by Remon van Vliet [ 15/Aug/12 ]

Hi Spencer. Unfortunately the logs are gone for that period. It's on a test environment that doesn't back up the logs. I'll make sure we store all the mongos logs when it occurs again. For now I suppose you can close this issue since I cannot provide further information. I'll report an actual issue next time with more information. Thanks for the support

Comment by Spencer Brody (Inactive) [ 15/Aug/12 ]

Hi Remon - any update on getting that log data so we can continue investigating what went wrong? I'd like logs from the mongos that was seeing the problem, as well as from nodes that were the primaries at the time for each shard's replica set. As stated before, I'd like to get the time period covering the full time that you were seeing the problem, as well as several hours before it started and after it was resolved.

Comment by Remon van Vliet [ 13/Aug/12 ]

I might be able to find the log data. It happens rather randomly. I understand we ideally shouldn't edit metadata but we need a fast solution and DNS tends to not be that. This seemed simplest but we'll look into alternatives. Still this seems like a state issue that shouldn't be allowed to happen.

Comment by Spencer Brody (Inactive) [ 10/Aug/12 ]

Do you know when exactly this started happening? Do you have logs from that time period? We'd like to have several hours worth of logs from before and after the problem appeared to help our debugging.

We strongly discourage editing config data directly. Instead of using IP addresses in your configuration directly, it would be better to use elastic hostnames/DNS hostnames that can be updated when the IP addresses of the shards change, without having to modify the config data.

Comment by Remon van Vliet [ 10/Aug/12 ]

Updating the config.shards collection so that the chunks are assigned to the correct (or wrong, i guess) shard kickstarts the balancing again. So basically the most important question is how do we end up in this state in the first place.

Comment by Remon van Vliet [ 10/Aug/12 ]

And more info, it seems that the data is on the wrong shard. If we connect to shard0006 directly we find 26 million users there but the chunks collection has all those chunks on shard0000. The only somewhat sketchy thing we do is that we overwrite the IP addresses of the config.shards collection with the new IPs if the shard machines (we have a rather unique usecase where we bring up large clusters for a TV show and bring them down again, when they go up EC2 assigns new IP addresses hence this "hack". It could be solved with naming). This is done when all mongos processes are down. The IP addresses are correct when this error occurs.

Comment by Remon van Vliet [ 10/Aug/12 ]

Related aborts in config.changelog :

{ "_id" : "xxxxxx.eu-west-1.compute.internal-2012-08-10T09:10:34-1", "server" : "xxxxxx.eu-west-1.compute.internal", "clientAddr" : "x.x.x.x:60276", "time" : ISODate("2012-08-10T09:10:34.741Z"), "what" : "moveChunk.from", "ns" : "user.user.users", "details" : { "min" :

{ "_h" : 417566254 }

, "max" :

{ "_h" : 447392415 }

, "step1" : 800, "step2" : 604, "note" : "aborted" } }
{ "_id" : "xxxxxx.eu-west-1.compute.internal-2012-08-10T09:10:37-3", "server" : "xxxxxx.eu-west-1.compute.internal", "clientAddr" : "x.x.x.x:46468", "time" : ISODate("2012-08-10T09:10:37.067Z"), "what" : "moveChunk.from", "ns" : "user.user.users", "details" : { "min" :

{ "_h" : 417566254 }

, "max" :

{ "_h" : 447392415 }

, "step1" : 0, "step2" : 341, "note" : "aborted" } }

Comment by Remon van Vliet [ 10/Aug/12 ]

To add to this, I think the issue is related to a stale lock (either cause or consequence) :

The collection that is no longer balancing is found in the config.locks collection and the same migrate keeps looping between state 0 and state 2 but as shown in the error never succeeds :

{ "_id" : "user.user.users", "process" : "xxxxxx.eu-west-1.compute.internal:27018:1344589833:30611607", "state" : 2, "ts" : ObjectId("5024dab6618939edbeef7e0e"), "when" : ISODate("2012-08-10T09:56:06.492Z"), "who" : "xxxxxx.eu-west-1.compute.internal:27018:1344589833:30611607:conn36:820470076", "why" : "migrate-

{ _h: 417566254 }

" }
mongos> db.locks.find({_id:"user.user.users"})
{ "_id" : "user.user.users", "process" : "xxxxxx.eu-west-1.compute.internal:27018:1344589833:30611607", "state" : 0, "ts" : ObjectId("5024dab6618939edbeef7e0e"), "when" : ISODate("2012-08-10T09:56:06.492Z"), "who" : "xxxxxx.eu-west-1.compute.internal:27018:1344589833:30611607:conn36:820470076", "why" : "migrate-

{ _h: 417566254 }

" }

Generated at Thu Feb 08 03:12:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.