Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-10662

Sharding stopped working on a collection

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.2.3
    • Component/s: Sharding
    • Environment:
      Linux debian 6.0.5
    • Linux

      Hello,

      We experienced a problem on our Mongo Sharded Cluster.
      We Use Mongo 2.2.3 on Linux Debian 6.0.5

      One of the 3 ConfigServs failed recently.
      The server came back online few hours later.
      On the mongoS logfile we could observe :

      Thu Aug 29 10:50:00 [CheckConfigServers] ERROR: config servers 172.16.16.1:27019 and 172.16.18.1:27019 differconfig servers 172.16.16.1:27019 and 172.16.18.1:27019 differconfig servers 172.16.16.1:27019 and 172.16.18.1:27019 differconfig servers 172.16.16.1:27019 and 172.16.18.1:27019 differconfig servers not in sync! config servers 172.16.16.1:27019 and 172.16.18.1:27019 differ

      To recover from this state we did :

      1) Disabled the Balancer with :
      sh.setBalancerState(false)

      2) Stopped the mongodb-conf daemon from the failed-server (server1) and from a second server (We let the third server and it ConfigServ running).

      3) Rsynced configdb datas from the server2 to the server1

      4) Re-Started mongodb-conf daemon from server2 : ok

      5) Re-Started mongodb-conf daemon from server1 : ok

      6) Enabled again the Balancer with :

      sh.setBalancerState(true)

      Everything seemed ok, but now we could see in logs this issue :

      [Balancer] caught exception while doing balance: not sharded:rawlogs.raw_log

      Collections seems to be present but sharding is not ok :

      mongos> db.collections.find()
      { "_id" : "rawlogs.raw_log", "lastmod" : ISODate("1970-01-16T19:08:22.332Z"), "dropped" : false, "key" :

      { "_id" : 1 }

      , "unique" : false, "lastmodEpoch" : ObjectId("515a78325c52d82fad24aa03") }
      { "_id" : "rawlogs.raw_log_ghost", "lastmod" : ISODate("1970-01-16T22:04:08.874Z"), "dropped" : false, "key" :

      { "_id" : 1 }

      , "unique" : false, "lastmodEpoch" : ObjectId("51fbaf295c52d82fad24eccb") }
      mongos>

      mongos> db.raw_log.stats()
      {
      "sharded" : false,
      "primary" : "shard1",
      "ns" : "rawlogs.raw_log",
      "count" : 2380607210,
      "size" : 1044708269072,
      "avgObjSize" : 438.84109259334724,
      "storageSize" : NumberLong("1116861681584"),
      "numExtents" : 541,
      "nindexes" : 1,
      "lastExtentSize" : 2146426864,
      "paddingFactor" : 1,
      "systemFlags" : 1,
      "userFlags" : 0,
      "totalIndexSize" : 80006256176,
      "indexSizes" :

      { "_id_" : 80006256176 }

      ,
      "ok" : 1
      }
      mongos>

      But apparently the config of the sharding is set (like it was)

      mongos> sh.status()
      — Sharding Status —
      sharding version:

      { "_id" : 1, "version" : 3 }

      shards:

      { "_id" : "shard1", "host" : "shard1/172.16.19.1:27018,172.16.19.2:27018" } { "_id" : "shard2", "host" : "shard2/172.16.19.3:27018,172.16.19.4:27018" } { "_id" : "shard3", "host" : "shard3/172.16.19.5:27018,172.16.19.6:27018" } { "_id" : "shard4", "host" : "shard4/172.16.19.7:27018,172.16.19.8:27018" } { "_id" : "shard5", "host" : "shard5/172.16.19.10:27018,172.16.19.9:27018" }

      databases:

      { "_id" : "admin", "partitioned" : false, "primary" : "config" } { "_id" : "rawlogs", "partitioned" : true, "primary" : "shard1" }

      rawlogs.raw_log chunks:
      shard1 24615
      shard2 8279
      shard3 3314
      shard4 3498
      shard5 10263
      too many chunks to print, use verbose if you want to force print
      rawlogs.raw_log_ghost chunks:
      shard1 368
      shard3 277
      shard2 277
      shard4 414
      shard5 1162
      too many chunks to print, use verbose if you want to force print

      { "_id" : "tempstats", "partitioned" : false, "primary" : "shard1" } { "_id" : "test", "partitioned" : false, "primary" : "shard3" } { "_id" : "stats", "partitioned" : false, "primary" : "shard5" } { "_id" : "rawlog", "partitioned" : false, "primary" : "shard5" }

      mongos>

      We also tryied to retry the procedure (recovery of Configserv1 from configServ2) with mongoS stopped.
      We haven't had much success but in the logs when we restarted the mongoS and re-enabled the balancer we could see in logs :

      Fri Aug 30 10:11:36 [Balancer] warning: got invalid chunk version 1|0||521f0c0563b2cfc94d8fad9b in document { _id: "rawlogs.raw_log-_id_MinKey", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('521f0c0563b2cfc94d8fad9b'), ns: "rawlogs.raw_log", min:

      { _id: MinKey }

      , max:

      { _id: BinData }

      , shard: "shard1" } when trying to load differing chunks at version 0|0||515a78325c52d82fad24aa03
      Fri Aug 30 10:11:36 [Balancer] warning: major change in chunk information found when reloading rawlogs.raw_log, previous version was 0|0||515a78325c52d82fad24aa03
      Fri Aug 30 10:11:36 [Balancer] ChunkManager: time to load chunks for rawlogs.raw_log: 48ms sequenceNumber: 2 version: 0|0||000000000000000000000000 based on: (empty)
      Fri Aug 30 10:11:36 [Balancer] warning: no chunks found for collection rawlogs.raw_log, assuming unsharded
      Fri Aug 30 10:11:36 [Balancer] ChunkManager: time to load chunks for rawlogs.raw_log_ghost: 31ms sequenceNumber: 3 version: 707|393||51fbaf295c52d82fad24eccb based on: (empty)
      Fri Aug 30 10:11:36 [Balancer] distributed lock 'balancer/mycompt.local:27021:1377850265:1804289383' unlocked.
      Fri Aug 30 10:11:36 [Balancer] scoped connection to 172.16.16.1:27019,172.16.18.1:27019,172.16.18.2:27019 not being returned to the pool
      Fri Aug 30 10:11:36 [Balancer] caught exception while doing balance: not sharded:rawlogs.raw_log

      We don't want the sharding to initiate 'from scratch'
      We'd like to enable the continuity of the past state (before config server1 failed).

      We've already tryied to refresh the MongoS with :
      db.adminCommand(

      {flushRouterConfig: 1}

      )
      Without better result.

      Unfortunately we haven't preserved the contents of the crashed config server that we replaced.

      Any idea please to resume sharding, please ?

            Assignee:
            Unassigned Unassigned
            Reporter:
            anthony@stickyads.tv Anthony Pastor
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: