Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-29810

Mongos no more refreshing chunks and trying impossible splits

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 3.4.4
    • Fix Version/s: None
    • Component/s: Sharding
    • Labels:
      None
    • Operating System:
      ALL

      Description

      Hello,

      Regularly our mongos stop refreshing chunks from config serv for some collections. And when trying to split chunk, produces "IncompatibleShardingMetadata: Unable to find chunk with the exact bounds" if the chunk was already split by another mongos.

      Our Mongo cluster details :

      • Many shards + config replica set, each formed by 3 members (1 primary + 2 secondary)
      • 2 mongos
      • Balancer is disabled
      • Package version 3.4.4, OS: Debian 8 Jessie
      • Servers: 6 cores Xeon CPU, 64GB RAM, ~3To SSD, ext4 file system
      • ~ 40 collections in 1 DB
      • Many writes and reads

      Classic scenario (shard, collection and fields names and values was replaced) :

      From mongos A logs :

      2017-06-21T14:46:42.087+0200 I SHARDING [conn6] Refreshing chunks for collection stats.collectionName based on version 9743|18393||5320f5e96789f4d11460c4a0
      2017-06-21T14:46:42.129+0200 I SHARDING [CatalogCacheLoader-1] Refresh for collection stats.collectionName took 42 ms and found version 9743|18393||5320f5e96789f4d11460c4a0
      

      From mongos B logs :

      2017-06-21T14:51:05.844+0200 I SHARDING [conn3103094] autosplitted stats.collectionName chunk: shard: shardName, lastmod: 9743|18367||5320f5e96789f4d11460c4a0, [{ _id: { d: 20170621, a: 78, c: 909090, d: 12345678 } }, { _id: { d: 20170621, a: 4444, b: 111111111, c: 222222, d: 3333333 } }) into 3 parts (splitThreshold 67108864) (migrate suggested, but no migrations allowed)
      

      From mongos A logs :

      2017-06-21T14:55:21.233+0200 I SHARDING [conn379] Split chunk { splitChunk: "stats.collectionName", configdb: "csReplSet/172.16.18.28:27025,172.16.18.3:27025,172.16.18.30:27025", from: "shardName", keyPattern: { _id: 1.0 }, shardVersion: [ Timestamp 9743000|149465, ObjectId('5320f5e96789f4d11460c4a0') ], min: { _id: { d: 20170621, a: 4444, b: 111111111, c: 222222, d: 3333333 } }, max: { _id: { d: 20170621, a: 4444, b: 121212121, c: 343434, d: 5656565 } }, splitKeys: [ { _id: { d: 20170621, a: 4444, b: 555555555, c: 666666, d: 7777777 } }, { _id: { d: 20170621, a: 4444, b: 888888888, c: 999, d: 000000 } } ] } failed :: caused by :: IncompatibleShardingMetadata: *Unable to find chunk with the exact bounds* [{ _id: { d: 20170621, a: 4444, b: 111111111, c: 222222, d: 3333333 } }, { _id: { d: 20170621, a: 4444, b: 121212121, c: 343434, d: 5656565 } }) at collection version 9743|18399||5320f5e96789f4d11460c4a0
      

      We can see that between refresh chunk and split try on mongos A, the other mongos already split that chunk. So the split try faild.

      The problem is that sometimes a mongos suddenly stops to refresh a collection until we restart / force it, so for a long time. And in that cases after few days the mongos is doing bigger and bigger split tries :

      2017-06-20T11:39:15.052+0200 I SHARDING [conn2766148] warning: log line attempted (53kB) over max size (10kB), printing beginning and end ... Split chunk { splitChunk: "stats.collectionName", configdb: "csReplSet/172.16.18.28:27025,172.16.18.3:27025,172.16.18.30:27025", from: "shardName", keyPattern: { _id: 1.0 }, shardVersion: [ Timestamp 3000|18274, ObjectId('5667717d46b7ddcd61ef5459') ], min: { _id: { d: 20170611, a: 111111, b: 2222222, c: 333, d: 333 } }, max: { _id: MaxKey }, splitKeys: [ ...... VERY LONG KEYS LIST ...... ] } failed :: caused by :: IncompatibleShardingMetadata: Unable to find chunk with the exact bounds [{ _id: { d: 20170611, a: 111111, b: 2222222, c: 333, d: 333 } }, { _id: MaxKey }) at collection version 3|19540||5667717d46b7ddcd61ef5459
      

      "_id.d" is the insert date, here was 20170611 but as you can see the log entry date is 2017-06-20. The diff is 9 days, 9 days of failed split tries. During this period, we found no chunk refresh in logs for the concerned collection. Theses big split tries slows a lot our shards (long splitVector queries on primary members) which is very troublesome for us.

      So we have to execute regularly a db.adminCommand("flushRouterConfig") on mongos to force refresh.

      Thank you in advance for your help.

      Best regards,
      Slawomir

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: