Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-16837

Migrations should time out after 10 hrs waiting for replication

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 2.8.0-rc4
    • Fix Version/s: 3.0.0-rc6
    • Component/s: Internal Code, Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide
      • Start a 2 shard cluster with each shard being 2 data bearers and an arbiter.
      • Shutdown the secondary on the second shard.
      • db.getSiblingDB("config").settings.update( { "_id" : "balancer" }, { $set : { "_secondaryThrottle" : false } }, { upsert : true })
        sh.enableSharding("test")
        sh.shardCollection("test.test", {_id:1})
        db.test.insert({_id:-1})
        db.test.insert({_id:1})
        sh.splitAt("test.test", {_id:0})

      • Watch the logs and check the "when" field in the locks.

        db.getSiblingDB("config").locks.find()

      Show
      Start a 2 shard cluster with each shard being 2 data bearers and an arbiter. Shutdown the secondary on the second shard. db.getSiblingDB("config").settings.update( { "_id" : "balancer" }, { $set : { "_secondaryThrottle" : false } }, { upsert : true }) sh.enableSharding("test") sh.shardCollection("test.test", {_id:1}) db.test.insert({_id:-1}) db.test.insert({_id:1}) sh.splitAt("test.test", {_id:0}) Watch the logs and check the "when" field in the locks. db.getSiblingDB("config").locks.find()

      Description

      2.8.0-rc4 does not time out migrations after they have been waiting for the pre-commit w:majority for 10 hours. This is a regression from 2.6, where such migrations are correctly aborted after 10 hours.

      Full logs and config dumps are attached. Relevant bits + output are below:

      2.8.0-rc4

      2015-01-13T02:11:12.428+0000 I SHARDING [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard_0/kev1-0.kevincm1.8331.mongodbdns.com:28000,kev1-0.kevincm1.8331.mongodbdns.com:28001 at epoch 54b4704956127074c739b3f6
      2015-01-13T02:11:12.429+0000 I SHARDING [migrateThread] Deleter starting delete for: test.test from { _id: MinKey } -> { _id: 0.0 }, with opId: 31603
      2015-01-13T02:11:12.429+0000 I SHARDING [migrateThread] rangeDeleter deleted 0 documents for test.test from { _id: MinKey } -> { _id: 0.0 }
      2015-01-13T02:11:12.429+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T02:11:13.430+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T02:11:14.430+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      ...
      2015-01-13T12:11:04.816+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:05.816+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:05.816+0000 W SHARDING [migrateThread] migrate commit waiting for a majority of slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b47ec0:1
      2015-01-13T12:11:06.816+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:07.817+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:08.817+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:09.817+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:10.817+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:11.817+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:12.818+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:13.818+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:14.818+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:15.818+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:16.818+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:17.818+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:18.819+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:19.819+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:20.819+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:20.819+0000 W SHARDING [migrateThread] migrate commit waiting for a majority of slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b47ec0:1
      2015-01-13T12:11:21.819+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T12:11:22.819+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      ...
      2015-01-14T02:47:43.070+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:47:44.071+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:47:45.071+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:47:46.071+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:47:46.071+0000 W SHARDING [migrateThread] migrate commit waiting for a majority of slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b47ec0:1
      2015-01-14T02:47:47.071+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:47:48.071+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:47:49.071+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:47:50.072+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:47:51.072+0000 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section

      mongos> printjson(ISODate()); db.getSiblingDB("config").locks.find()
      ISODate("2015-01-13T04:11:44.821Z")
      { "_id" : "configUpgrade", "state" : 0, "who" : "kev1-0:27017:1421111141:1804289383:mongosMain:846930886", "ts" : ObjectId("54b46f6556127074c739b3bf"), "process" : "kev1-0:27017:1421111141:1804289383", "when" : ISODate("2015-01-13T01:05:41.265Z"), "why" : "upgrading config database to new format v6" }
      { "_id" : "balancer", "state" : 2, "who" : "kev1-0:27017:1421111141:1804289383:Balancer:1681692777", "ts" : ObjectId("54b47ec056127074c739b40c"), "process" : "kev1-0:27017:1421111141:1804289383", "when" : ISODate("2015-01-13T02:11:12.413Z"), "why" : "doing balance round" }
      { "_id" : "test.test", "state" : 2, "who" : "kev1-0:28000:1421111452:808384670:conn14:1257153344", "ts" : ObjectId("54b47ec073e56bbf64caa124"), "process" : "kev1-0:28000:1421111452:808384670", "when" : ISODate("2015-01-13T02:11:12.422Z"),
      "why" : "migrating chunk [{ : MinKey }, { : MaxKey }) in test.test" }
      mongos> printjson(ISODate()); db.getSiblingDB("config").locks.find()
      ISODate("2015-01-14T02:24:24.884Z")
      { "_id" : "configUpgrade", "state" : 0, "who" : "kev1-0:27017:1421111141:1804289383:mongosMain:846930886", "ts" : ObjectId("54b46f6556127074c739b3bf"), "process" : "kev1-0:27017:1421111141:1804289383", "when" : ISODate("2015-01-13T01:05:41.265Z"), "why" : "upgrading config database to new format v6" }
      { "_id" : "balancer", "state" : 2, "who" : "kev1-0:27017:1421111141:1804289383:Balancer:1681692777", "ts" : ObjectId("54b47ec056127074c739b40c"), "process" : "kev1-0:27017:1421111141:1804289383", "when" : ISODate("2015-01-13T02:11:12.413Z"), "why" : "doing balance round" }
      { "_id" : "test.test", "state" : 2, "who" : "kev1-0:28000:1421111452:808384670:conn14:1257153344", "ts" : ObjectId("54b47ec073e56bbf64caa124"), "process" : "kev1-0:28000:1421111452:808384670", "when" : ISODate("2015-01-13T02:11:12.422Z"),
      "why" : "migrating chunk [{ : MinKey }, { : MaxKey }) in test.test" }

      NOTE the when field for the balancer and collection locks hasn't changed in 24 hours.

      2.6.6

      2015-01-13T06:03:31.460+0000 [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard26_0/kev1-0.kevincm1.8331.mongodbdns.com:38000,kev1-0.kevincm1.8331.mongodbdns.com:38001 at epoch 54b4b3437d941a7cc774cb02
      2015-01-13T06:03:31.460+0000 [migrateThread] starting new replica set monitor for replica set shard26_0 with seeds kev1-0.kevincm1.8331.mongodbdns.com:38000,kev1-0.kevincm1.8331.mongodbdns.com:38001
      2015-01-13T06:03:31.462+0000 [migrateThread] allocating new ns file /data/balancer-majority-timeout-26_shard26_1_96/test.ns, filling with zeroes...
      2015-01-13T06:03:31.575+0000 [migrateThread] build index on: test.test properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.test" }
      2015-01-13T06:03:31.575+0000 [migrateThread]     added index to empty collection
      2015-01-13T06:03:31.576+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T06:03:32.576+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T06:03:33.577+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      ...
      2015-01-13T16:03:22.105+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:03:23.105+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:03:23.105+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b4b533:2
      2015-01-13T16:03:24.105+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:03:25.105+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:03:26.105+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:03:27.106+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:03:28.106+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:03:29.106+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:03:30.106+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:03:31.106+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:03:32.903+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b4b533:2
      2015-01-13T16:03:32.905+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b4b533:2
      2015-01-13T16:03:32.906+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b4b533:2
      ...
      2015-01-13T16:04:02.894+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b4b533:2
      2015-01-13T16:04:02.895+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b4b533:2
      2015-01-13T16:04:02.897+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b4b533:2
      2015-01-13T16:04:02.898+0000 [migrateThread] scoped connection to shard26_0/kev1-0.kevincm1.8331.mongodbdns.com:38000,kev1-0.kevincm1.8331.mongodbdns.com:38001 not being returned to the pool
      2015-01-13T16:04:02.898+0000 [migrateThread] about to log metadata event: { _id: "kev1-0-2015-01-13T16:04:02-54b541f29fcdf3ea989d3fae", server: "kev1-0", clientAddr: ":27017", time: new Date(1421165042898), what: "moveChunk.to", ns: "test.test", details: { min: { _id: MinKey }, max: { _id: 0.0 }, step 1 of 5: 115, step 2 of 5: 0, step 3 of 5: 0, step 4 of 5: 0, note: "aborted", errmsg: "timed out waiting for commit" } }
      2015-01-13T16:04:08.924+0000 [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard26_0/kev1-0.kevincm1.8331.mongodbdns.com:38000,kev1-0.kevincm1.8331.mongodbdns.com:38001 at epoch 54b4b3437d941a7cc774cb02
      2015-01-13T16:04:08.926+0000 [migrateThread] warning: moveChunkCmd deleted data already in chunk # objects: 1
      2015-01-13T16:04:08.926+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:04:09.926+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:04:09.927+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b541f8:3
      2015-01-13T16:04:10.927+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:04:11.927+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-13T16:04:12.927+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      ...
      2015-01-14T02:04:00.576+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:04:01.577+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:04:01.577+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b541f8:3
      2015-01-14T02:04:02.577+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:04:03.577+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:04:04.577+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:04:05.577+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:04:06.578+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:04:07.578+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:04:08.578+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:04:09.733+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b541f8:3
      2015-01-14T02:04:09.734+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b541f8:3
      2015-01-14T02:04:09.735+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b541f8:3
      ...
      2015-01-14T02:04:39.723+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b541f8:3
      2015-01-14T02:04:39.725+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b541f8:3
      2015-01-14T02:04:39.727+0000 [migrateThread] warning: migrate commit waiting for 2 slaves for 'test.test' { _id: MinKey } -> { _id: 0.0 } waiting for: 54b541f8:3
      2015-01-14T02:04:39.727+0000 [migrateThread] scoped connection to shard26_0/kev1-0.kevincm1.8331.mongodbdns.com:38000,kev1-0.kevincm1.8331.mongodbdns.com:38001 not being returned to the pool
      2015-01-14T02:04:39.727+0000 [migrateThread] about to log metadata event: { _id: "kev1-0-2015-01-14T02:04:39-54b5ceb79fcdf3ea989d3faf", server: "kev1-0", clientAddr: ":27017", time: new Date(1421201079727), what: "moveChunk.to", ns: "test.test", details: { min: { _id: MinKey }, max: { _id: 0.0 }, step 1 of 5: 1, step 2 of 5: 0, step 3 of 5: 0, step 4 of 5: 0, note: "aborted", errmsg: "timed out waiting for commit" } }
      2015-01-14T02:04:45.754+0000 [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard26_0/kev1-0.kevincm1.8331.mongodbdns.com:38000,kev1-0.kevincm1.8331.mongodbdns.com:38001 at epoch 54b4b3437d941a7cc774cb02
      2015-01-14T02:04:45.754+0000 [migrateThread] warning: moveChunkCmd deleted data already in chunk # objects: 1
      2015-01-14T02:04:45.755+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:04:46.755+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      2015-01-14T02:04:47.755+0000 [migrateThread] Waiting for replication to catch up before entering critical section
      ...

      mongos> printjson(ISODate()); db.getSiblingDB("config").locks.find()
      ISODate("2015-01-13T08:03:21.334Z")
      { "_id" : "configUpgrade", "state" : 0, "who" : "kev1-0:37017:1421128191:1804289383:mongosMain:846930886", "ts" : ObjectId("54b4b1ff7d941a7cc774cac3"), "process" : "kev1-0:37017:1421128191:1804289383", "when" : ISODate("2015-01-13T05:49:51.669Z"), "why" : "upgrading config database to new format v5" }
      { "_id" : "balancer", "state" : 2, "who" : "kev1-0:37017:1421128191:1804289383:Balancer:1681692777", "ts" : ObjectId("54b4b5337d941a7cc774cb56"), "process" : "kev1-0:37017:1421128191:1804289383", "when" : ISODate("2015-01-13T06:03:31.442Z"), "why" : "doing balance round" }
      { "_id" : "test.test", "state" : 2, "who" : "kev1-0:38000:1421129010:1611849556:conn23:745631594", "ts" : ObjectId("54b4b53363588f4e90ed5a12"), "process" : "kev1-0:38000:1421129010:1611849556", "when" : ISODate("2015-01-13T06:03:31.451Z"), "why" : "migrate-{ _id: MinKey }" }
      mongos> printjson(ISODate()); db.getSiblingDB("config").locks.find()
      ISODate("2015-01-14T01:41:32.572Z")
      { "_id" : "configUpgrade", "state" : 0, "who" : "kev1-0:37017:1421128191:1804289383:mongosMain:846930886", "ts" : ObjectId("54b4b1ff7d941a7cc774cac3"), "process" : "kev1-0:37017:1421128191:1804289383", "when" : ISODate("2015-01-13T05:49:51.669Z"), "why" : "upgrading config database to new format v5" }
      { "_id" : "balancer", "state" : 2, "who" : "kev1-0:37017:1421128191:1804289383:Balancer:1681692777", "ts" : ObjectId("54b541f87d941a7cc774cb57"), "process" : "kev1-0:37017:1421128191:1804289383", "when" : ISODate("2015-01-13T16:04:08.907Z"), "why" : "doing balance round" }
      { "_id" : "test.test", "state" : 2, "who" : "kev1-0:38000:1421129010:1611849556:conn23:745631594", "ts" : ObjectId("54b541f863588f4e90ed5a15"), "process" : "kev1-0:38000:1421129010:1611849556", "when" : ISODate("2015-01-13T16:04:08.916Z"), "why" : "migrate-{ _id: MinKey }" }
      mongos> printjson(ISODate()); db.getSiblingDB("config").locks.find()
      ISODate("2015-01-14T02:53:25.690Z")
      { "_id" : "configUpgrade", "state" : 0, "who" : "kev1-0:37017:1421128191:1804289383:mongosMain:846930886", "ts" : ObjectId("54b4b1ff7d941a7cc774cac3"), "process" : "kev1-0:37017:1421128191:1804289383", "when" : ISODate("2015-01-13T05:49:51.669Z"), "why" : "upgrading config database to new format v5" }
      { "_id" : "balancer", "state" : 2, "who" : "kev1-0:37017:1421128191:1804289383:Balancer:1681692777", "ts" : ObjectId("54b5cebd7d941a7cc774cb58"), "process" : "kev1-0:37017:1421128191:1804289383", "when" : ISODate("2015-01-14T02:04:45.736Z"), "why" : "doing balance round" }
      { "_id" : "test.test", "state" : 2, "who" : "kev1-0:38000:1421129010:1611849556:conn23:745631594", "ts" : ObjectId("54b5cebd63588f4e90ed5a18"), "process" : "kev1-0:38000:1421129010:1611849556", "when" : ISODate("2015-01-14T02:04:45.746Z"), "why" : "migrate-{ _id: MinKey }" }

      NOTE the when field for the balancer and collection locks changes every 10 hours when the migration fails and restarts.

        Attachments

        1. 2.6.6.tar.gz
          4.75 MB
        2. 2.8.0-rc4.tar.gz
          3.91 MB

          Issue Links

            Activity

              People

              Assignee:
              renctan Randolph Tan
              Reporter:
              kevin.pulo Kevin Pulo
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: