[SERVER-16582] Chunk Migration Failing Repeatedly on Initial Balancing Round Created: 18/Dec/14  Updated: 24/Jan/15  Resolved: 18/Dec/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.8.0-rc2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: William Cross Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive config_dump.zip     Text File mongos.log    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:
  • At bash prompt:

    mlaunch init --sharded 3 --replicaset
    mongo

    Note: I'm using mlaunch version 1.1.6

  • At the mongo shell:

    for (i=1; i<=1000; i++ ) {x = []; for (j=1; j<=1000; j++) {x.push( { a : i, b : j, c : 1000 * i + j, _id : 1000 * i + j } )}; db.foo.insert(x) }
     
    db.foo.ensureIndex( { a : 1, b : 1 }, { name : "first" } )
    db.foo.ensureIndex( { b : 1 }, { name : "second" } )
    sh.enableSharding("test")
    sh.shardCollection("test.foo", { b : 1 } )

  • Wait a minute or two, and then run sh.status().

Expected result: Data is in several chunks, and the load is balanced with no errors.

Actual result: Migration failures as the load balancer starts: "Failed with error 'chunk too big to move', from shard01 to shard03" and "Failed with error 'chunk too big to move', from shard01 to shard02", though the chunks seem to eventually get where they need to go.

Here is the output of sh.status() partway through:

mongos> sh.status()
--- Sharding Status ---
  sharding version: {
	"_id" : 1,
	"minCompatibleVersion" : 5,
	"currentVersion" : 6,
	"clusterId" : ObjectId("5492159f53be077898567039")
}
  shards:
	{  "_id" : "shard01",  "host" : "shard01/cross-mb-air.local:27018,cross-mb-air.local:27019,cross-mb-air.local:27020" }
	{  "_id" : "shard02",  "host" : "shard02/cross-mb-air.local:27021,cross-mb-air.local:27022,cross-mb-air.local:27023" }
	{  "_id" : "shard03",  "host" : "shard03/cross-mb-air.local:27024,cross-mb-air.local:27025,cross-mb-air.local:27026" }
  balancer:
	Currently enabled:  yes
	Currently running:  yes
		Balancer lock taken at undefined by undefined
	Collections with active migrations:
		test.foo started at Wed Dec 17 2014 19:05:28 GMT-0500 (EST)
	Failed balancer rounds in last 5 attempts:  0
	Migration Results for the last 24 hours:
		3 : Success
		2 : Failed with error 'chunk too big to move', from shard01 to shard03
		1 : Failed with error 'chunk too big to move', from shard01 to shard02
  databases:
	{  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
	{  "_id" : "test",  "partitioned" : true,  "primary" : "shard01" }
		test.foo
			shard key: { "b" : 1 }
			chunks:
				shard01	4
				shard02	2
				shard03	1
			{ "b" : { "$minKey" : 1 } } -->> { "b" : 1 } on : shard02 Timestamp(2, 0)
			{ "b" : 1 } -->> { "b" : 150 } on : shard03 Timestamp(3, 0)
			{ "b" : 150 } -->> { "b" : 300 } on : shard02 Timestamp(4, 0)
			{ "b" : 300 } -->> { "b" : 450 } on : shard01 Timestamp(4, 2)
			{ "b" : 450 } -->> { "b" : 600 } on : shard01 Timestamp(4, 3)
			{ "b" : 600 } -->> { "b" : 899 } on : shard01 Timestamp(1, 2)
			{ "b" : 899 } -->> { "b" : { "$maxKey" : 1 } } on : shard01 Timestamp(1, 3)
 
mongos>

Participants:

 Description   

The balancer is getting stuck on its initial balancing load.



 Comments   
Comment by William Cross [ 18/Dec/14 ]

scotthernandez, I think I misunderstood what was happening. Also, when I opened the ticket, I had originally thought that the balancer was abandoning the balancing process (though it was taking awhile). In any case, I am closing the ticket.

Comment by Scott Hernandez (Inactive) [ 18/Dec/14 ]

I don't know what you mean, can you explain a bit more?

When sharding existing collections all the chunks must be on a single shard and then be distributed over time. The balancer and shard commands (shardCollection, enableSharding, splitCollection, etc) are independent processes so as soon as the chunk metadata exists the balancer will start working from it.

Comment by William Cross [ 18/Dec/14 ]

Files attached.

Yes, they were jumbo chunks, but why was a migration attempted while the chunks were still getting their initial set of chunk splits? Shouldn't it wait a couple of minutes before trying & failing?

I'm not seeing this behavior in 2.6, but maybe it's just that I noticed it in 2.8.

Comment by Scott Hernandez (Inactive) [ 18/Dec/14 ]

Please provide logs and a dump of the config database. At first glance this looks like totally normal behavior if there are jumbo chunks.

Generated at Thu Feb 08 03:41:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.