[SERVER-16805] Balancer should not migrate chunks to shards that cannot satisfy w:majority Created: 12/Jan/15  Updated: 21/Sep/23  Resolved: 04/Apr/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.6
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Kevin Pulo Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Won't Do Votes: 2
Labels: PM229, balancer, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Sharding EMEA
Participants:

 Description   

If a shard is currently unable to do w:majority writes, it should not be considered as a target for balancer moves. Such migrations are highly likely to stall for an extended period (up to 10 hours) at pre-commit, causing orphans and wasted time (during which other migrations cannot occur).

While the shard could become healthy and able to support w:majority during the migration, it would usually be better to wait for that to be known before starting the migration. If the problematic shard is avoided, the balancer can instead be balancing chunks amongst the remaining shards, since these migrations will have a much better chance of completing successfully.

This is only for chunk moves by the balancer. If a user knows that the target shard will be able to catch up a majority within 10 hours, then they should be able to run a manual moveChunk. By contrast, the balancer has no way of knowing this state of affairs. In the absence of this knowledge, it should assume the state of the system won't change, ie. the migration won't succeed.

To repro this (2.6.6):

$ mlaunch init --replicaset --nodes 2 --arbiter --smallfiles --oplogSize 100 --sharded 2 --config 1 --mongos 1 --port 33333
launching: mongod on port 33334
launching: mongod on port 33335
launching: mongod on port 33336
launching: mongod on port 33337
launching: mongod on port 33338
launching: mongod on port 33339
launching: config server on port 33340
replica set 'shard01' initialized.
replica set 'shard02' initialized.
launching: mongos on port 33333
adding shards. can take up to 30 seconds...
$ mlaunch stop shard02 secondary
1 node stopped.
$ mlaunch list
 
PROCESS          PORT     STATUS     PID
 
mongos           33333    running    5919
 
config server    33340    running    5826
 
shard01
    primary      33334    running    5463
    secondary    33335    running    5483
    arbiter      33336    running    5530
 
shard02
    primary      33337    running    5582
    mongod       33338    down       -
    arbiter      33339    running    5757
 
$ mongo --port 33333
MongoDB shell version: 2.6.6
connecting to: 127.0.0.1:33333/test
> sh.enableSharding("test")
{ "ok" : 1 }
> sh.shardCollection("test.test", {_id:1})
{ "collectionsharded" : "test.test", "ok" : 1 }
> db.test.insert({_id:-1})
WriteResult({ "nInserted" : 1 })
> db.test.insert({_id:1})
WriteResult({ "nInserted" : 1 })
> sh.splitAt("test.test", {_id:0})
{ "ok" : 1 }
> sleep(60*1000)
null
> db.getSiblingDB("config").locks.find()
{ "_id" : "configUpgrade", "state" : 0, "who" : "genique:33333:1421043589:1804289383:mongosMain:846930886", "ts" : ObjectId("54b3678501891d25c2a0de29"), "process" : "genique:33333:1421043589:1804289383", "when" : ISODate("2015-01-12T06:19:49.271Z"), "why" : "upgrading config database to new format v5" }
{ "_id" : "balancer", "state" : 2, "who" : "genique:33333:1421043589:1804289383:Balancer:1681692777", "ts" : ObjectId("54b3743d01891d25c2a0e051"), "process" : "genique:33333:1421043589:1804289383", "when" : ISODate("2015-01-12T07:14:05.817Z"), "why" : "doing balance round" }
{ "_id" : "test.test", "state" : 2, "who" : "genique:33334:1421046843:292657564:conn8:1901975678", "ts" : ObjectId("54b3743d26b588f243d1e4a7"), "process" : "genique:33334:1421046843:292657564", "when" : ISODate("2015-01-12T07:14:05.830Z"), "why" : "migrate-{ _id: MinKey }" }

Logfile on shard02 primary confirms

$ grep migrateThread data/shard02/rs1/mongod.log | head -10
2015-01-12T18:14:05.843+1100 [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard01/genique:33334,genique:33335 at epoch 54b370c601891d25c2a0dfbc
2015-01-12T18:14:05.843+1100 [migrateThread] starting new replica set monitor for replica set shard01 with seeds genique:33334,genique:33335
2015-01-12T18:14:05.845+1100 [migrateThread] allocating new ns file /home/kev/testing/balancer-majority/data/shard02/rs1/db/test.ns, filling with zeroes...
2015-01-12T18:14:05.966+1100 [migrateThread] build index on: test.test properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.test" }
2015-01-12T18:14:05.967+1100 [migrateThread]     added index to empty collection
2015-01-12T18:14:05.978+1100 [migrateThread] Helpers::removeRangeUnlocked time spent waiting for replication: 0ms
2015-01-12T18:15:05.979+1100 [migrateThread] warning: secondaryThrottle on, but doc insert timed out after 60 seconds, continuing
2015-01-12T18:15:05.980+1100 [migrateThread] Waiting for replication to catch up before entering critical section
2015-01-12T18:15:06.980+1100 [migrateThread] Waiting for replication to catch up before entering critical section
2015-01-12T18:15:07.980+1100 [migrateThread] Waiting for replication to catch up before entering critical section

In 2.8.0-rc4 the only difference is a 1 hour timeout for the initial rangeDeleter, after which the migration receive fails and is retried (I also disabled secondaryThrottle in this test):

$ grep migrateThread data/shard02/rs1/mongod.log | head -20
2015-01-12T21:31:51.871+1100 I SHARDING [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard01/genique:23334,genique:23335 at epoch 54b3a02aaea4b148f49e0882
2015-01-12T21:31:51.871+1100 I NETWORK  [migrateThread] starting new replica set monitor for replica set shard01 with seeds genique:23334,genique:23335
2015-01-12T21:31:51.874+1100 I INDEX    [migrateThread] allocating new ns file /home/kev/testing/balancer-majority-2.8/data/shard02/rs1/db/test.ns, filling with zeroes...
2015-01-12T21:31:51.993+1100 I INDEX    [migrateThread] build index on: test.test properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.test" }
2015-01-12T21:31:51.993+1100 I INDEX    [migrateThread]          building index using bulk method
2015-01-12T21:31:51.993+1100 I INDEX    [migrateThread] build index done.  scanned 0 total records. 0 secs
2015-01-12T21:31:51.993+1100 I SHARDING [migrateThread] Deleter starting delete for: test.test from { _id: MinKey } -> { _id: 0.0 }, with opId: 749
2015-01-12T21:31:51.993+1100 I SHARDING [migrateThread] rangeDeleter deleted 0 documents for test.test from { _id: MinKey } -> { _id: 0.0 }
2015-01-12T22:31:51.995+1100 I SHARDING [migrateThread] rangeDeleter timed out after 0 seconds while waiting for deletions to be replicated to majority nodes
2015-01-12T22:31:51.995+1100 W SHARDING [migrateThread] Failed to queue delete for migrate abort: rangeDeleter timed out after 0 seconds while waiting for deletions to be replicated to majority nodes
2015-01-12T22:31:51.996+1100 I NETWORK  [migrateThread] scoped connection to shard01/genique:23334,genique:23335 not being returned to the pool
2015-01-12T22:31:51.996+1100 I SHARDING [migrateThread] about to log metadata event: { _id: "genique-2015-01-12T11:31:51-54b3b0a75e62b2a409dceac3", server: "genique", clientAddr: ":27017", time: new Date(1421062311996), what: "moveChunk.to", ns: "test.test", details: { min: { _id: MinKey }, max: { _id: 0.0 }, step 1 of 5: 122, note: "aborted" } }
2015-01-12T22:31:52.001+1100 W SHARDING [migrateThread] cannot remove pending chunk [{ _id: MinKey }, { _id: 0.0 }), this shard does not contain the chunk
2015-01-12T22:31:52.001+1100 W SHARDING [migrateThread] cannot remove pending chunk [{ _id: MinKey }, { _id: 0.0 }), this shard does not contain the chunk
2015-01-12T22:32:03.032+1100 I SHARDING [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard01/genique:23334,genique:23335 at epoch 54b3a02aaea4b148f49e0882
2015-01-12T22:32:03.033+1100 I SHARDING [migrateThread] Deleter starting delete for: test.test from { _id: MinKey } -> { _id: 0.0 }, with opId: 6925
2015-01-12T22:32:03.033+1100 I SHARDING [migrateThread] rangeDeleter deleted 0 documents for test.test from { _id: MinKey } -> { _id: 0.0 }
2015-01-12T22:32:03.034+1100 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
2015-01-12T22:32:04.034+1100 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
2015-01-12T22:32:05.034+1100 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section



 Comments   
Comment by Connie Chen [ 18/Jan/23 ]

Taking out of PM-631 and placing in "Needs Scheduling" since we have closed PM-631 as won't do 

Generated at Thu Feb 08 03:42:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.