-
Type:
Bug
-
Resolution: Done
-
Priority:
Critical - P2
-
None
-
Affects Version/s: 2.2.3
-
Component/s: Sharding
-
None
-
ALL
-
None
-
0
-
None
-
None
-
None
-
None
-
None
-
None
We have a shard that is nearly full, while other more recently added shards are nearly empty. I was wondering why a particularly large sharded collection is much larger on two shards than others:
c.mongoose.command('collStats', 'centaur') Out[51]: {u'ns': u'mongoose.centaur', u'sharded': True, u'shards': {u'rs0': u'size': 152076069392.0, u'storageSize': 154752093200.0}, u'rs1': { u'size': 114195152272.0, u'storageSize': 116004875472.0}, u'rs2': { u'size': 13189608896.0, u'storageSize': 13481529296.0,}, u'rs3': { u'size': 11661895416.0, u'storageSize': 13594558416.0,}, u'rs4': { u'size': 7642675920.0, u'storageSize': 7859703808.0}, u'size': 298765401896.0, u'storageSize': 305692760192.0, u'totalIndexSize': 434202832}
Data is 10x the size on rs0/rs1 than 2,3,4
Looking at the config DB logs:
mongos> db.changelog.find({ns: "mongoose.centaur"}).sort({time: -1}).pretty() { "_id" : "dlonapahls211.maninvestments.com-2013-04-04T09:30:32-8", "server" : "dlonapahls211.maninvestments.com", "clientAddr" : "10.220.54.3:47785", "time" : ISODate("2013-04-04T09:30:32.119Z"), "what" : "split", "ns" : "mongoose.centaur", "details" : {
The last trace for mongoose.centaur is 36 hours ago.
If I look in config.locks, I see:
mongos> db.locks.find({_id: "mongoose.centaur"}).pretty() { "_id" : "mongoose.centaur", "process" : "dlonapahls211.maninvestments.com:27118:1365067417:1678333556", "state" : 0, "ts" : ObjectId("515d48375b3b8c1d49d2b605"), "when" : ISODate("2013-04-04T09:30:31.958Z"), "who" : "dlonapahls211.maninvestments.com:27118:1365067417:1678333556:conn575:189782212", "why" : "split-{ symbol: \"thingy\" }" }
There are many 'old' - some from February, March - in this collection in state 0. It's not clear to me what state 0 is, and whether this is important...
On dlonapahls211's mongos log, for 2013-04-04T09:30:31.958 I see:
Thu Apr 4 09:29:10 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' acquired, ts : 515d39d60c29edf10f5bf0e2 Thu Apr 4 09:29:10 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' unlocked. Thu Apr 4 09:29:17 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' acquired, ts : 515d39dc0c29edf10f5bf0e3 Thu Apr 4 09:29:17 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' unlocked. Thu Apr 4 09:29:46 [LockPinger] cluster cn53:27117,cn54:27117,cn55:27117 pinged successfully at Thu Apr 4 09:29:46 2013 by distributed lock pinger 'cn53:27117,cn54:27117,cn55:27117/dlonapahls211.maninvestments.com:27119:1365027221:1804289383', sleeping for 30000ms Thu Apr 4 09:30:11 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' acquired, ts : 515d3a130c29edf10f5bf0e4 Thu Apr 4 09:30:11 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' unlocked. Thu Apr 4 09:30:30 [Balancer] could not acquire lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' (another update won) Thu Apr 4 09:30:30 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' was not acquired. Thu Apr 4 09:30:42 [Balancer] could not acquire lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' (another update won) Thu Apr 4 09:30:42 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' was not acquired. Thu Apr 4 09:31:54 [Balancer] warning: distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383 did not propagate properly. :: caused by :: 8017 update not consistent ns: config.locks query: { _id: "balancer", state: 0, ts: ObjectId('515d3a7ab07f340a2aafd3bf') } update: { $set: { state: 1, who: "dlonapahls211.maninvestments.com:27119:1365027221:1804289383:Balancer:846930886", process: "dlonapahls211.maninvestments.com:27119:1365027221:1804289383", when: new Date(1365064314764), why: "doing balance round", ts: ObjectId('515d3a7a0c29edf10f5bf0e7') } } gle1: { updatedExisting: true, n: 1, connectionId: 774, waited: 34, err: null, ok: 1.0 } gle2: { updatedExisting: false, n: 0, connectionId: 813, waited: 34, err: null, ok: 1.0 } Thu Apr 4 09:31:54 [Balancer] lock update lost, lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' not propagated. Thu Apr 4 09:31:54 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' was not acquired. Thu Apr 4 09:32:17 [LockPinger] trying to delete 1 old lock entries for process dlonapahls211.maninvestments.com:27119:1365027221:1804289383 Thu Apr 4 09:32:17 [LockPinger] handled late remove of old distributed lock with ts 515d3a7a0c29edf10f5bf0e7 Thu Apr 4 09:32:36 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' acquired, ts : 515d3aa40c29edf10f5bf0e8 Thu Apr 4 09:32:37 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' unlocked.
Any ideas on what the issue might be? How do I go about getting mongoose.centaur to be balanced again?
Is there any way to see which collections Mongo thinks are eligible for balancing - i.e. how can I tell if it will eventually balance vs. it will never balance?