[SERVER-56116] Balancing failed when moving big collection Created: 15/Apr/21  Updated: 11/May/21  Resolved: 11/May/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Wernfried Domscheit Assignee: Eric Sedor
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

I like to archive my data as described in Tiered Hardware for Varying SLA or SLO

My sharded cluster looks like this:

db.getSiblingDB("config").shards.find({}, { tags: 1 })
{ "_id" : "shard_01", "tags" : ["recent"] }
{ "_id" : "shard_02", "tags" : ["recent"] }
{ "_id" : "shard_03", "tags" : ["recent"] }
{ "_id" : "shard_04", "tags" : ["archive"] }
 
db.getSiblingDB("config").collections.find({ _id: "data.sessions.20210412.zoned" }, { key: 1 })
{
   "_id": "data.sessions.20210412.zoned",
   "key": { "tsi": 1.0, "si": 1.0 }
}
 
db.getSiblingDB("data").getCollection("sessions.20210412.zoned").getShardDistribution()
 
Shard shard_03 at shard_03/d-mipmdb-sh1-03.swi.srse.net:27018,d-mipmdb-sh2-03.swi.srse.net:27018
 data : 63.18GiB docs : 16202743 chunks : 2701
 estimated data per chunk : 23.95MiB
 estimated docs per chunk : 5998
 
Shard shard_02 at shard_02/d-mipmdb-sh1-02.swi.srse.net:27018,d-mipmdb-sh2-02.swi.srse.net:27018
 data : 55.6GiB docs : 14259066 chunks : 2367
 estimated data per chunk : 24.05MiB
 estimated docs per chunk : 6024
 
Shard shard_01 at shard_01/d-mipmdb-sh1-01.swi.srse.net:27018,d-mipmdb-sh2-01.swi.srse.net:27018
 data : 68.92GiB docs : 23896624 chunks : 3034
 estimated data per chunk : 23.26MiB
 estimated docs per chunk : 7876
 
Totals
 data : 187.72GiB docs : 54358433 chunks : 8102
 Shard shard_03 contains 33.66% data, 29.8% docs in cluster, avg obj size on shard : 4KiB
 Shard shard_02 contains 29.62% data, 26.23% docs in cluster, avg obj size on shard : 4KiB
 Shard shard_01 contains 36.71% data, 43.96% docs in cluster, avg obj size on shard : 3KiB

In order to trigger migration I use

sh.disableBalancing('data.sessions.20210412.zoned')
if (db.getSiblingDB("config").migrations.findOne({ ns: 'data.sessions.20210412.zoned' }) == null) {
   sh.updateZoneKeyRange('data.sessions.20210412.zoned', { "tsi": MinKey, "si": MinKey }, { "tsi": MaxKey, "si": MaxKey }, null)
   sh.updateZoneKeyRange('data.sessions.20210412.zoned', { "tsi": MinKey, "si": MinKey }, { "tsi": MaxKey, "si": MaxKey }, 'archive')
}
sh.enableBalancing('data.sessions.20210412.zoned')

I don't get any error and migration starts. However, in my logs (at config server) I get thousands or even millions of these warnings:

{
  "t": {
    "$date": "2021-04-15T14:56:28.984+02:00"
  },
  "s": "W",
  "c": "SHARDING",
  "id": 21892,
  "ctx": "Balancer",
  "msg": "Chunk violates zone, but no appropriate recipient found",
  "attr": {
    "chunk": "{ ns: \"data.sessions.20210412.zoned\", min: { tsi: \"194.230.147.157\", si: \"10.38.15.1\" }, max: { tsi: \"194.230.147.157\", si: \"10.40.230.198\" }, shard: \"shard_03\", lastmod: Timestamp(189, 28), lastmodEpoch: ObjectId('60780e581ad069faafa363ba'), jumbo: false }",
    "zone": "archive"
  }
}

The file system reached 100% and MongoDB stopped working.

How can this be? `MinKey` / `MayKey` should cover all values.

 

 

 

 

 



 Comments   
Comment by Eric Sedor [ 11/May/21 ]

I understand wernfried.domscheit@sunrise.net. I'm going to close this ticket for now, but we are happy to reopen the ticket or address a new one if you experience the problem again.

Thank you,
Eric

Comment by Wernfried Domscheit [ 05/May/21 ]

Today I tried the procedure again and it was working fine as expected. So, I am not able to reproduce this error.

I don't know whether I should be happy about this or not, because I don't like to run into this issue suddenly on production.

No idea if we should keep this ticket open.

Best Regards
Wernfried

 

Comment by Wernfried Domscheit [ 29/Apr/21 ]

Currently I run some other topics, I think next week I can provide the required information.

Comment by Eric Sedor [ 27/Apr/21 ]

Hi wernfried.domscheit@sunrise.net,

The error message you've found suggests that there isn't a problem with the range tagging, but rather a problem identifying shard04 as a destination. It's not immediately clear this is the result of a bug, but we can take an initial look.

Can you please provide a mongodump of the config database as well as the logs from the config server?

Comment by Wernfried Domscheit [ 21/Apr/21 ]

Just a note, when I tested this procedure with a smaller collection (i.e. 1 Million documents) it is all working fine. But it fails at bigger collection with 50M documents.

Best Regards
Wernfried

Generated at Thu Feb 08 05:38:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.