Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-16805

Balancer should not migrate chunks to shards that cannot satisfy w:majority

    • Sharding EMEA

      If a shard is currently unable to do w:majority writes, it should not be considered as a target for balancer moves. Such migrations are highly likely to stall for an extended period (up to 10 hours) at pre-commit, causing orphans and wasted time (during which other migrations cannot occur).

      While the shard could become healthy and able to support w:majority during the migration, it would usually be better to wait for that to be known before starting the migration. If the problematic shard is avoided, the balancer can instead be balancing chunks amongst the remaining shards, since these migrations will have a much better chance of completing successfully.

      This is only for chunk moves by the balancer. If a user knows that the target shard will be able to catch up a majority within 10 hours, then they should be able to run a manual moveChunk. By contrast, the balancer has no way of knowing this state of affairs. In the absence of this knowledge, it should assume the state of the system won't change, ie. the migration won't succeed.

      To repro this (2.6.6):

      $ mlaunch init --replicaset --nodes 2 --arbiter --smallfiles --oplogSize 100 --sharded 2 --config 1 --mongos 1 --port 33333
      launching: mongod on port 33334
      launching: mongod on port 33335
      launching: mongod on port 33336
      launching: mongod on port 33337
      launching: mongod on port 33338
      launching: mongod on port 33339
      launching: config server on port 33340
      replica set 'shard01' initialized.
      replica set 'shard02' initialized.
      launching: mongos on port 33333
      adding shards. can take up to 30 seconds...
      $ mlaunch stop shard02 secondary
      1 node stopped.
      $ mlaunch list
      
      PROCESS          PORT     STATUS     PID
      
      mongos           33333    running    5919
      
      config server    33340    running    5826
      
      shard01
          primary      33334    running    5463
          secondary    33335    running    5483
          arbiter      33336    running    5530
      
      shard02
          primary      33337    running    5582
          mongod       33338    down       -
          arbiter      33339    running    5757
      
      $ mongo --port 33333
      MongoDB shell version: 2.6.6
      connecting to: 127.0.0.1:33333/test
      > sh.enableSharding("test")
      { "ok" : 1 }
      > sh.shardCollection("test.test", {_id:1})
      { "collectionsharded" : "test.test", "ok" : 1 }
      > db.test.insert({_id:-1})
      WriteResult({ "nInserted" : 1 })
      > db.test.insert({_id:1})
      WriteResult({ "nInserted" : 1 })
      > sh.splitAt("test.test", {_id:0})
      { "ok" : 1 }
      > sleep(60*1000)
      null
      > db.getSiblingDB("config").locks.find()
      { "_id" : "configUpgrade", "state" : 0, "who" : "genique:33333:1421043589:1804289383:mongosMain:846930886", "ts" : ObjectId("54b3678501891d25c2a0de29"), "process" : "genique:33333:1421043589:1804289383", "when" : ISODate("2015-01-12T06:19:49.271Z"), "why" : "upgrading config database to new format v5" }
      { "_id" : "balancer", "state" : 2, "who" : "genique:33333:1421043589:1804289383:Balancer:1681692777", "ts" : ObjectId("54b3743d01891d25c2a0e051"), "process" : "genique:33333:1421043589:1804289383", "when" : ISODate("2015-01-12T07:14:05.817Z"), "why" : "doing balance round" }
      { "_id" : "test.test", "state" : 2, "who" : "genique:33334:1421046843:292657564:conn8:1901975678", "ts" : ObjectId("54b3743d26b588f243d1e4a7"), "process" : "genique:33334:1421046843:292657564", "when" : ISODate("2015-01-12T07:14:05.830Z"), "why" : "migrate-{ _id: MinKey }" }
      

      Logfile on shard02 primary confirms

      $ grep migrateThread data/shard02/rs1/mongod.log | head -10
      2015-01-12T18:14:05.843+1100 [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard01/genique:33334,genique:33335 at epoch 54b370c601891d25c2a0dfbc
      2015-01-12T18:14:05.843+1100 [migrateThread] starting new replica set monitor for replica set shard01 with seeds genique:33334,genique:33335
      2015-01-12T18:14:05.845+1100 [migrateThread] allocating new ns file /home/kev/testing/balancer-majority/data/shard02/rs1/db/test.ns, filling with zeroes...
      2015-01-12T18:14:05.966+1100 [migrateThread] build index on: test.test properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.test" }
      2015-01-12T18:14:05.967+1100 [migrateThread]     added index to empty collection
      2015-01-12T18:14:05.978+1100 [migrateThread] Helpers::removeRangeUnlocked time spent waiting for replication: 0ms
      2015-01-12T18:15:05.979+1100 [migrateThread] warning: secondaryThrottle on, but doc insert timed out after 60 seconds, continuing
      2015-01-12T18:15:05.980+1100 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-12T18:15:06.980+1100 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-12T18:15:07.980+1100 [migrateThread] Waiting for replication to catch up before entering critical section
      

      In 2.8.0-rc4 the only difference is a 1 hour timeout for the initial rangeDeleter, after which the migration receive fails and is retried (I also disabled secondaryThrottle in this test):

      $ grep migrateThread data/shard02/rs1/mongod.log | head -20
      2015-01-12T21:31:51.871+1100 I SHARDING [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard01/genique:23334,genique:23335 at epoch 54b3a02aaea4b148f49e0882
      2015-01-12T21:31:51.871+1100 I NETWORK  [migrateThread] starting new replica set monitor for replica set shard01 with seeds genique:23334,genique:23335
      2015-01-12T21:31:51.874+1100 I INDEX    [migrateThread] allocating new ns file /home/kev/testing/balancer-majority-2.8/data/shard02/rs1/db/test.ns, filling with zeroes...
      2015-01-12T21:31:51.993+1100 I INDEX    [migrateThread] build index on: test.test properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.test" }
      2015-01-12T21:31:51.993+1100 I INDEX    [migrateThread]          building index using bulk method
      2015-01-12T21:31:51.993+1100 I INDEX    [migrateThread] build index done.  scanned 0 total records. 0 secs
      2015-01-12T21:31:51.993+1100 I SHARDING [migrateThread] Deleter starting delete for: test.test from { _id: MinKey } -> { _id: 0.0 }, with opId: 749
      2015-01-12T21:31:51.993+1100 I SHARDING [migrateThread] rangeDeleter deleted 0 documents for test.test from { _id: MinKey } -> { _id: 0.0 }
      2015-01-12T22:31:51.995+1100 I SHARDING [migrateThread] rangeDeleter timed out after 0 seconds while waiting for deletions to be replicated to majority nodes
      2015-01-12T22:31:51.995+1100 W SHARDING [migrateThread] Failed to queue delete for migrate abort: rangeDeleter timed out after 0 seconds while waiting for deletions to be replicated to majority nodes
      2015-01-12T22:31:51.996+1100 I NETWORK  [migrateThread] scoped connection to shard01/genique:23334,genique:23335 not being returned to the pool
      2015-01-12T22:31:51.996+1100 I SHARDING [migrateThread] about to log metadata event: { _id: "genique-2015-01-12T11:31:51-54b3b0a75e62b2a409dceac3", server: "genique", clientAddr: ":27017", time: new Date(1421062311996), what: "moveChunk.to", ns: "test.test", details: { min: { _id: MinKey }, max: { _id: 0.0 }, step 1 of 5: 122, note: "aborted" } }
      2015-01-12T22:31:52.001+1100 W SHARDING [migrateThread] cannot remove pending chunk [{ _id: MinKey }, { _id: 0.0 }), this shard does not contain the chunk
      2015-01-12T22:31:52.001+1100 W SHARDING [migrateThread] cannot remove pending chunk [{ _id: MinKey }, { _id: 0.0 }), this shard does not contain the chunk
      2015-01-12T22:32:03.032+1100 I SHARDING [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard01/genique:23334,genique:23335 at epoch 54b3a02aaea4b148f49e0882
      2015-01-12T22:32:03.033+1100 I SHARDING [migrateThread] Deleter starting delete for: test.test from { _id: MinKey } -> { _id: 0.0 }, with opId: 6925
      2015-01-12T22:32:03.033+1100 I SHARDING [migrateThread] rangeDeleter deleted 0 documents for test.test from { _id: MinKey } -> { _id: 0.0 }
      2015-01-12T22:32:03.034+1100 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-12T22:32:04.034+1100 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-12T22:32:05.034+1100 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      

            Assignee:
            backlog-server-sharding-emea [DO NOT USE] Backlog - Sharding EMEA
            Reporter:
            kevin.pulo@mongodb.com Kevin Pulo
            Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: