Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9262

Mongo distributed locks aren't released => balancing stops working

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Fixed
    • Affects Version/s: 2.2.3
    • Fix Version/s: None
    • Component/s: Sharding
    • Labels:
      None
    • Operating System:
      ALL

      Description

      We have a shard that is nearly full, while other more recently added shards are nearly empty. I was wondering why a particularly large sharded collection is much larger on two shards than others:

      c.mongoose.command('collStats', 'centaur')
      Out[51]: 
      {u'ns': u'mongoose.centaur',
       u'sharded': True,
       u'shards': {u'rs0': 
         u'size': 152076069392.0,
         u'storageSize': 154752093200.0},
        u'rs1': {
         u'size': 114195152272.0,
         u'storageSize': 116004875472.0},
        u'rs2': {
         u'size': 13189608896.0,
         u'storageSize': 13481529296.0,},
        u'rs3': {
         u'size': 11661895416.0,
         u'storageSize': 13594558416.0,},
        u'rs4': {
         u'size': 7642675920.0,
         u'storageSize': 7859703808.0},
       u'size': 298765401896.0,
       u'storageSize': 305692760192.0,
       u'totalIndexSize': 434202832}

      Data is 10x the size on rs0/rs1 than 2,3,4

      Looking at the config DB logs:

      mongos> db.changelog.find({ns: "mongoose.centaur"}).sort({time: -1}).pretty()
      {
              "_id" : "dlonapahls211.maninvestments.com-2013-04-04T09:30:32-8",
              "server" : "dlonapahls211.maninvestments.com",
              "clientAddr" : "10.220.54.3:47785",
              "time" : ISODate("2013-04-04T09:30:32.119Z"),
              "what" : "split",
              "ns" : "mongoose.centaur",
              "details" : {

      The last trace for mongoose.centaur is 36 hours ago.

      If I look in config.locks, I see:

      mongos> db.locks.find({_id: "mongoose.centaur"}).pretty()
      {
              "_id" : "mongoose.centaur",
              "process" : "dlonapahls211.maninvestments.com:27118:1365067417:1678333556",
              "state" : 0,
              "ts" : ObjectId("515d48375b3b8c1d49d2b605"),
              "when" : ISODate("2013-04-04T09:30:31.958Z"),
              "who" : "dlonapahls211.maninvestments.com:27118:1365067417:1678333556:conn575:189782212",
              "why" : "split-{ symbol: \"thingy\" }"
      }

      There are many 'old' - some from February, March - in this collection in state 0. It's not clear to me what state 0 is, and whether this is important...

      On dlonapahls211's mongos log, for 2013-04-04T09:30:31.958 I see:

      Thu Apr  4 09:29:10 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' acquired, ts : 515d39d60c29edf10f5bf0e2
      Thu Apr  4 09:29:10 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' unlocked. 
      Thu Apr  4 09:29:17 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' acquired, ts : 515d39dc0c29edf10f5bf0e3
      Thu Apr  4 09:29:17 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' unlocked. 
      Thu Apr  4 09:29:46 [LockPinger] cluster cn53:27117,cn54:27117,cn55:27117 pinged successfully at Thu Apr  4 09:29:46 2013 by distributed lock pinger 'cn53:27117,cn54:27117,cn55:27117/dlonapahls211.maninvestments.com:27119:1365027221:1804289383', sleeping for 30000ms
      Thu Apr  4 09:30:11 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' acquired, ts : 515d3a130c29edf10f5bf0e4
      Thu Apr  4 09:30:11 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' unlocked. 
      Thu Apr  4 09:30:30 [Balancer] could not acquire lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' (another update won)
      Thu Apr  4 09:30:30 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' was not acquired.
      Thu Apr  4 09:30:42 [Balancer] could not acquire lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' (another update won)
      Thu Apr  4 09:30:42 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' was not acquired.
      Thu Apr  4 09:31:54 [Balancer] warning: distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383 did not propagate properly. :: caused by :: 8017 update not consistent  ns: config.locks query: { _id: "balancer", state: 0, ts: ObjectId('515d3a7ab07f340a2aafd3bf') } update: { $set: { state: 1, who: "dlonapahls211.maninvestments.com:27119:1365027221:1804289383:Balancer:846930886", process: "dlonapahls211.maninvestments.com:27119:1365027221:1804289383", when: new Date(1365064314764), why: "doing balance round", ts: ObjectId('515d3a7a0c29edf10f5bf0e7') } } gle1: { updatedExisting: true, n: 1, connectionId: 774, waited: 34, err: null, ok: 1.0 } gle2: { updatedExisting: false, n: 0, connectionId: 813, waited: 34, err: null, ok: 1.0 }
      Thu Apr  4 09:31:54 [Balancer] lock update lost, lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' not propagated.
      Thu Apr  4 09:31:54 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' was not acquired.
      Thu Apr  4 09:32:17 [LockPinger] trying to delete 1 old lock entries for process dlonapahls211.maninvestments.com:27119:1365027221:1804289383
      Thu Apr  4 09:32:17 [LockPinger] handled late remove of old distributed lock with ts 515d3a7a0c29edf10f5bf0e7
      Thu Apr  4 09:32:36 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' acquired, ts : 515d3aa40c29edf10f5bf0e8
      Thu Apr  4 09:32:37 [Balancer] distributed lock 'balancer/dlonapahls211.maninvestments.com:27119:1365027221:1804289383' unlocked. 

      Any ideas on what the issue might be? How do I go about getting mongoose.centaur to be balanced again?

      Is there any way to see which collections Mongo thinks are eligible for balancing - i.e. how can I tell if it will eventually balance vs. it will never balance?

        Attachments

          Activity

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            jblackburn James Blackburn
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: