Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Critical - P2
Fix Version/s: None
Affects Version/s: 2.2.3
Component/s: Sharding
Labels:
None

Operating System:
ALL
Confidence Status:
None
Work Order:
0

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

We have a shard that is nearly full, while other more recently added shards are nearly empty. I was wondering why a particularly large sharded collection is much larger on two shards than others:

c.mongoose.command('collStats', 'centaur')
Out[51]: 
{u'ns': u'mongoose.centaur',
 u'sharded': True,
 u'shards': {u'rs0': 
   u'size': 152076069392.0,
   u'storageSize': 154752093200.0},
  u'rs1': {
   u'size': 114195152272.0,
   u'storageSize': 116004875472.0},
  u'rs2': {
   u'size': 13189608896.0,
   u'storageSize': 13481529296.0,},
  u'rs3': {
   u'size': 11661895416.0,
   u'storageSize': 13594558416.0,},
  u'rs4': {
   u'size': 7642675920.0,
   u'storageSize': 7859703808.0},
 u'size': 298765401896.0,
 u'storageSize': 305692760192.0,
 u'totalIndexSize': 434202832}

Data is 10x the size on rs0/rs1 than 2,3,4

Looking at the config DB logs:

mongos> db.changelog.find({ns: "mongoose.centaur"}).sort({time: -1}).pretty()
{
        "_id" : "dlonapahls211.maninvestments.com-2013-04-04T09:30:32-8",
        "server" : "dlonapahls211.maninvestments.com",
        "clientAddr" : "10.220.54.3:47785",
        "time" : ISODate("2013-04-04T09:30:32.119Z"),
        "what" : "split",
        "ns" : "mongoose.centaur",
        "details" : {

The last trace for mongoose.centaur is 36 hours ago.

If I look in config.locks, I see:

mongos> db.locks.find({_id: "mongoose.centaur"}).pretty()
{
        "_id" : "mongoose.centaur",
        "process" : "dlonapahls211.maninvestments.com:27118:1365067417:1678333556",
        "state" : 0,
        "ts" : ObjectId("515d48375b3b8c1d49d2b605"),
        "when" : ISODate("2013-04-04T09:30:31.958Z"),
        "who" : "dlonapahls211.maninvestments.com:27118:1365067417:1678333556:conn575:189782212",
        "why" : "split-{ symbol: \"thingy\" }"
}

There are many 'old' - some from February, March - in this collection in state 0. It's not clear to me what state 0 is, and whether this is important...

On dlonapahls211's mongos log, for 2013-04-04T09:30:31.958 I see:

Thu Apr  4 09:29:10 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' acquired, ts : 515d39d60c29edf10f5bf0e2
Thu Apr  4 09:29:10 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' unlocked. 
Thu Apr  4 09:29:17 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' acquired, ts : 515d39dc0c29edf10f5bf0e3
Thu Apr  4 09:29:17 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' unlocked. 
Thu Apr  4 09:29:46 [LockPinger] cluster cn53:27117,cn54:27117,cn55:27117 pinged successfully at Thu Apr  4 09:29:46 2013 by distributed lock pinger 'cn53:27117,cn54:27117,cn55:27117/dlonapahls211.maninvestments.com:27119:1365027221:1804289383', sleeping for 30000ms
Thu Apr  4 09:30:11 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' acquired, ts : 515d3a130c29edf10f5bf0e4
Thu Apr  4 09:30:11 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' unlocked. 
Thu Apr  4 09:30:30 [Balancer] could not acquire lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' (another update won)
Thu Apr  4 09:30:30 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' was not acquired.
Thu Apr  4 09:30:42 [Balancer] could not acquire lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' (another update won)
Thu Apr  4 09:30:42 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' was not acquired.
Thu Apr  4 09:31:54 [Balancer] warning: distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383 did not propagate properly. :: caused by :: 8017 update not consistent  ns: config.locks query: { _id: "balancer", state: 0, ts: ObjectId('515d3a7ab07f340a2aafd3bf') } update: { $set: { state: 1, who: "dlonapahls211.maninvestments.com:27119:1365027221:1804289383:Balancer:846930886", process: "dlonapahls211.maninvestments.com:27119:1365027221:1804289383", when: new Date(1365064314764), why: "doing balance round", ts: ObjectId('515d3a7a0c29edf10f5bf0e7') } } gle1: { updatedExisting: true, n: 1, connectionId: 774, waited: 34, err: null, ok: 1.0 } gle2: { updatedExisting: false, n: 0, connectionId: 813, waited: 34, err: null, ok: 1.0 }
Thu Apr  4 09:31:54 [Balancer] lock update lost, lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' not propagated.
Thu Apr  4 09:31:54 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' was not acquired.
Thu Apr  4 09:32:17 [LockPinger] trying to delete 1 old lock entries for process dlonapahls211.maninvestments.com:27119:1365027221:1804289383
Thu Apr  4 09:32:17 [LockPinger] handled late remove of old distributed lock with ts 515d3a7a0c29edf10f5bf0e7
Thu Apr  4 09:32:36 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' acquired, ts : 515d3aa40c29edf10f5bf0e8
Thu Apr  4 09:32:37 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' unlocked.

Any ideas on what the issue might be? How do I go about getting mongoose.centaur to be balanced again?

Is there any way to see which collections Mongo thinks are eligible for balancing - i.e. how can I tell if it will eventually balance vs. it will never balance?

Assignee:: Unassigned
Reporter:: James Blackburn
Participants:: Daniel Pasette, Eliot Horowitz, James Blackburn, kishore battula, Stennie Steneker
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Apr 05 2013 05:04:08 PM UTC
Updated:: Jul 11 2016 05:57:45 PM UTC
Resolved:: Apr 13 2013 12:58:16 PM UTC

Details

Description

Attachments

Activity

People

Dates