-
Type:
Improvement
-
Resolution: Won't Do
-
Priority:
Major - P3
-
None
-
Affects Version/s: 2.6.6
-
Component/s: Sharding
-
Sharding EMEA
-
None
-
None
-
None
-
None
-
None
-
None
-
None
If a shard is currently unable to do w:majority writes, it should not be considered as a target for balancer moves. Such migrations are highly likely to stall for an extended period (up to 10 hours) at pre-commit, causing orphans and wasted time (during which other migrations cannot occur).
While the shard could become healthy and able to support w:majority during the migration, it would usually be better to wait for that to be known before starting the migration. If the problematic shard is avoided, the balancer can instead be balancing chunks amongst the remaining shards, since these migrations will have a much better chance of completing successfully.
This is only for chunk moves by the balancer. If a user knows that the target shard will be able to catch up a majority within 10 hours, then they should be able to run a manual moveChunk. By contrast, the balancer has no way of knowing this state of affairs. In the absence of this knowledge, it should assume the state of the system won't change, ie. the migration won't succeed.
To repro this (2.6.6):
$ mlaunch init --replicaset --nodes 2 --arbiter --smallfiles --oplogSize 100 --sharded 2 --config 1 --mongos 1 --port 33333
launching: mongod on port 33334
launching: mongod on port 33335
launching: mongod on port 33336
launching: mongod on port 33337
launching: mongod on port 33338
launching: mongod on port 33339
launching: config server on port 33340
replica set 'shard01' initialized.
replica set 'shard02' initialized.
launching: mongos on port 33333
adding shards. can take up to 30 seconds...
$ mlaunch stop shard02 secondary
1 node stopped.
$ mlaunch list
PROCESS PORT STATUS PID
mongos 33333 running 5919
config server 33340 running 5826
shard01
primary 33334 running 5463
secondary 33335 running 5483
arbiter 33336 running 5530
shard02
primary 33337 running 5582
mongod 33338 down -
arbiter 33339 running 5757
$ mongo --port 33333
MongoDB shell version: 2.6.6
connecting to: 127.0.0.1:33333/test
> sh.enableSharding("test")
{ "ok" : 1 }
> sh.shardCollection("test.test", {_id:1})
{ "collectionsharded" : "test.test", "ok" : 1 }
> db.test.insert({_id:-1})
WriteResult({ "nInserted" : 1 })
> db.test.insert({_id:1})
WriteResult({ "nInserted" : 1 })
> sh.splitAt("test.test", {_id:0})
{ "ok" : 1 }
> sleep(60*1000)
null
> db.getSiblingDB("config").locks.find()
{ "_id" : "configUpgrade", "state" : 0, "who" : "genique:33333:1421043589:1804289383:mongosMain:846930886", "ts" : ObjectId("54b3678501891d25c2a0de29"), "process" : "genique:33333:1421043589:1804289383", "when" : ISODate("2015-01-12T06:19:49.271Z"), "why" : "upgrading config database to new format v5" }
{ "_id" : "balancer", "state" : 2, "who" : "genique:33333:1421043589:1804289383:Balancer:1681692777", "ts" : ObjectId("54b3743d01891d25c2a0e051"), "process" : "genique:33333:1421043589:1804289383", "when" : ISODate("2015-01-12T07:14:05.817Z"), "why" : "doing balance round" }
{ "_id" : "test.test", "state" : 2, "who" : "genique:33334:1421046843:292657564:conn8:1901975678", "ts" : ObjectId("54b3743d26b588f243d1e4a7"), "process" : "genique:33334:1421046843:292657564", "when" : ISODate("2015-01-12T07:14:05.830Z"), "why" : "migrate-{ _id: MinKey }" }
Logfile on shard02 primary confirms
$ grep migrateThread data/shard02/rs1/mongod.log | head -10
2015-01-12T18:14:05.843+1100 [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard01/genique:33334,genique:33335 at epoch 54b370c601891d25c2a0dfbc
2015-01-12T18:14:05.843+1100 [migrateThread] starting new replica set monitor for replica set shard01 with seeds genique:33334,genique:33335
2015-01-12T18:14:05.845+1100 [migrateThread] allocating new ns file /home/kev/testing/balancer-majority/data/shard02/rs1/db/test.ns, filling with zeroes...
2015-01-12T18:14:05.966+1100 [migrateThread] build index on: test.test properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.test" }
2015-01-12T18:14:05.967+1100 [migrateThread] added index to empty collection
2015-01-12T18:14:05.978+1100 [migrateThread] Helpers::removeRangeUnlocked time spent waiting for replication: 0ms
2015-01-12T18:15:05.979+1100 [migrateThread] warning: secondaryThrottle on, but doc insert timed out after 60 seconds, continuing
2015-01-12T18:15:05.980+1100 [migrateThread] Waiting for replication to catch up before entering critical section
2015-01-12T18:15:06.980+1100 [migrateThread] Waiting for replication to catch up before entering critical section
2015-01-12T18:15:07.980+1100 [migrateThread] Waiting for replication to catch up before entering critical section
In 2.8.0-rc4 the only difference is a 1 hour timeout for the initial rangeDeleter, after which the migration receive fails and is retried (I also disabled secondaryThrottle in this test):
$ grep migrateThread data/shard02/rs1/mongod.log | head -20
2015-01-12T21:31:51.871+1100 I SHARDING [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard01/genique:23334,genique:23335 at epoch 54b3a02aaea4b148f49e0882
2015-01-12T21:31:51.871+1100 I NETWORK [migrateThread] starting new replica set monitor for replica set shard01 with seeds genique:23334,genique:23335
2015-01-12T21:31:51.874+1100 I INDEX [migrateThread] allocating new ns file /home/kev/testing/balancer-majority-2.8/data/shard02/rs1/db/test.ns, filling with zeroes...
2015-01-12T21:31:51.993+1100 I INDEX [migrateThread] build index on: test.test properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.test" }
2015-01-12T21:31:51.993+1100 I INDEX [migrateThread] building index using bulk method
2015-01-12T21:31:51.993+1100 I INDEX [migrateThread] build index done. scanned 0 total records. 0 secs
2015-01-12T21:31:51.993+1100 I SHARDING [migrateThread] Deleter starting delete for: test.test from { _id: MinKey } -> { _id: 0.0 }, with opId: 749
2015-01-12T21:31:51.993+1100 I SHARDING [migrateThread] rangeDeleter deleted 0 documents for test.test from { _id: MinKey } -> { _id: 0.0 }
2015-01-12T22:31:51.995+1100 I SHARDING [migrateThread] rangeDeleter timed out after 0 seconds while waiting for deletions to be replicated to majority nodes
2015-01-12T22:31:51.995+1100 W SHARDING [migrateThread] Failed to queue delete for migrate abort: rangeDeleter timed out after 0 seconds while waiting for deletions to be replicated to majority nodes
2015-01-12T22:31:51.996+1100 I NETWORK [migrateThread] scoped connection to shard01/genique:23334,genique:23335 not being returned to the pool
2015-01-12T22:31:51.996+1100 I SHARDING [migrateThread] about to log metadata event: { _id: "genique-2015-01-12T11:31:51-54b3b0a75e62b2a409dceac3", server: "genique", clientAddr: ":27017", time: new Date(1421062311996), what: "moveChunk.to", ns: "test.test", details: { min: { _id: MinKey }, max: { _id: 0.0 }, step 1 of 5: 122, note: "aborted" } }
2015-01-12T22:31:52.001+1100 W SHARDING [migrateThread] cannot remove pending chunk [{ _id: MinKey }, { _id: 0.0 }), this shard does not contain the chunk
2015-01-12T22:31:52.001+1100 W SHARDING [migrateThread] cannot remove pending chunk [{ _id: MinKey }, { _id: 0.0 }), this shard does not contain the chunk
2015-01-12T22:32:03.032+1100 I SHARDING [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard01/genique:23334,genique:23335 at epoch 54b3a02aaea4b148f49e0882
2015-01-12T22:32:03.033+1100 I SHARDING [migrateThread] Deleter starting delete for: test.test from { _id: MinKey } -> { _id: 0.0 }, with opId: 6925
2015-01-12T22:32:03.033+1100 I SHARDING [migrateThread] rangeDeleter deleted 0 documents for test.test from { _id: MinKey } -> { _id: 0.0 }
2015-01-12T22:32:03.034+1100 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
2015-01-12T22:32:04.034+1100 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
2015-01-12T22:32:05.034+1100 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section