Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-11301

moveChunk command failing in some sharding tests on slow hosts

    • Fully Compatible
    • Windows

      This affects the following test scripts:

      • jstests/sharding/cleanup_orphaned_cmd_prereload.js
      • jstests/sharding/features1.js
      • jstests/sharding/key_many.js

      When a moveChunk is issued with a destination mongod process that is still finishing a previous moveChunk request, the new moveChunk command will fail with a "migrate already in progress" error.

      Resolution:

      Especially when the moveChunk command is not the primary goal of the test, it is fine to wrap the moveChunk command in a limited retry logic to accommodate slow hosts.

      3b87e21f1151 Windows 32-bit sharding : features1.js (logfile)

       m30999| 2013-10-22T04:17:42.826+0000 [conn1] CMD: shardcollection: { shardcollection: "test.foo6", key: { a: 1.0 } }
       m30999| 2013-10-22T04:17:42.827+0000 [conn1] enable sharding on: test.foo6 with shard key: { a: 1.0 }
       m30999| 2013-10-22T04:17:42.827+0000 [conn1] about to log metadata event: { _id: "EC2AMAZ-ZVRDQB5-2013-10-22T04:17:42-5265fc6639672851e14c5983", server: "EC2AMAZ-ZVRDQB5", clientAddr: "N/A", time: new Date(1382415462827), what: "shardCollection.start", ns: "test.foo6", details: { shardKey: { a: 1.0 }, collection: "test.foo6", primary: "shard0001:localhost:30001", initShards: [], numChunks: 1 } }
       m30999| 2013-10-22T04:17:42.828+0000 [conn1] going to create 1 chunk(s) for: test.foo6 using new epoch 5265fc6639672851e14c5984
       m30999| 2013-10-22T04:17:42.829+0000 [conn1] ChunkManager: time to load chunks for test.foo6: 0ms sequenceNumber: 10 version: 1|0||5265fc6639672851e14c5984 based on: (empty)
       m30999| 2013-10-22T04:17:42.830+0000 [conn1]     setShardVersion  shard0001 localhost:30001  test.foo6  { setShardVersion: "test.foo6", configdb: "localhost:30000", version: Timestamp 1000|0, versionEpoch: ObjectId('5265fc6639672851e14c5984'), serverID: ObjectId('5265fc6039672851e14c5973'), shard: "shard0001", shardHost: "localhost:30001" } 00B2D5B0 10
       m30999| 2013-10-22T04:17:42.830+0000 [conn1]        setShardVersion failed!
       m30999| { oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ns: "test.foo6", need_authoritative: true, ok: 0.0, errmsg: "first time for collection 'test.foo6'" }
       m30999| 2013-10-22T04:17:42.831+0000 [conn1]     setShardVersion  shard0001 localhost:30001  test.foo6  { setShardVersion: "test.foo6", configdb: "localhost:30000", version: Timestamp 1000|0, versionEpoch: ObjectId('5265fc6639672851e14c5984'), serverID: ObjectId('5265fc6039672851e14c5973'), authoritative: true, shard: "shard0001", shardHost: "localhost:30001" } 00B2D5B0 10
       m30001| 2013-10-22T04:17:42.831+0000 [conn3] remotely refreshing metadata for test.foo6 with requested shard version 1|0||5265fc6639672851e14c5984, current shard version is 0|0||000000000000000000000000
       m30001| 2013-10-22T04:17:42.833+0000 [conn3] loaded new metadata for test.foo6 (loaded version : 1|0||5265fc6639672851e14c5984, stored version : 0|0||000000000000000000000000, took 1ms)
       m30999| 2013-10-22T04:17:42.833+0000 [conn1]       setShardVersion success: { oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ok: 1.0 }
       m30999| 2013-10-22T04:17:42.833+0000 [conn1] about to log metadata event: { _id: "EC2AMAZ-ZVRDQB5-2013-10-22T04:17:42-5265fc6639672851e14c5985", server: "EC2AMAZ-ZVRDQB5", clientAddr: "N/A", time: new Date(1382415462833), what: "shardCollection", ns: "test.foo6", details: { version: "1|0||5265fc6639672851e14c5984" } }
       m30999| 2013-10-22T04:17:42.835+0000 [conn1] splitting chunk [{ a: MinKey },{ a: MaxKey }) in collection test.foo6 on shard shard0001
       m30001| 2013-10-22T04:17:42.836+0000 [conn4] received splitChunk request: { splitChunk: "test.foo6", keyPattern: { a: 1.0 }, min: { a: MinKey }, max: { a: MaxKey }, from: "shard0001", splitKeys: [ { a: 2.0 } ], shardId: "test.foo6-a_MinKey", configdb: "localhost:30000" }
       m30001| 2013-10-22T04:17:42.838+0000 [conn4] distributed lock 'test.foo6/EC2AMAZ-ZVRDQB5:30001:1382415456:41' acquired, ts : 5265fc66ee4bb720b15bc871
       m30001| 2013-10-22T04:17:42.839+0000 [conn4] remotely refreshing metadata for test.foo6 based on current shard version 1|0||5265fc6639672851e14c5984
       m30001| 2013-10-22T04:17:42.841+0000 [conn4] newer metadata not found for test.foo6 (loaded version : 1|0||5265fc6639672851e14c5984, stored version : 1|0||5265fc6639672851e14c5984, took 1ms)
       m30001| 2013-10-22T04:17:42.841+0000 [conn4] splitChunk accepted at version 1|0||5265fc6639672851e14c5984
       m30001| 2013-10-22T04:17:42.842+0000 [conn4] about to log metadata event: { _id: "EC2AMAZ-ZVRDQB5-2013-10-22T04:17:42-5265fc66ee4bb720b15bc872", server: "EC2AMAZ-ZVRDQB5", clientAddr: "127.0.0.1:55273", time: new Date(1382415462842), what: "split", ns: "test.foo6", details: { before: { min: { a: MinKey }, max: { a: MaxKey }, lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('000000000000000000000000') }, left: { min: { a: MinKey }, max: { a: 2.0 }, lastmod: Timestamp 1000|1, lastmodEpoch: ObjectId('5265fc6639672851e14c5984') }, right: { min: { a: 2.0 }, max: { a: MaxKey }, lastmod: Timestamp 1000|2, lastmodEpoch: ObjectId('5265fc6639672851e14c5984') } } }
       m30001| 2013-10-22T04:17:42.845+0000 [conn4] distributed lock 'test.foo6/EC2AMAZ-ZVRDQB5:30001:1382415456:41' unlocked. 
       m30999| 2013-10-22T04:17:42.846+0000 [conn1] ChunkManager: time to load chunks for test.foo6: 0ms sequenceNumber: 11 version: 1|2||5265fc6639672851e14c5984 based on: 1|0||5265fc6639672851e14c5984
       m30999| 2013-10-22T04:17:42.849+0000 [conn1] CMD: movechunk: { movechunk: "test.foo6", find: { a: 3.0 }, to: "localhost:30000" }
       m30999| 2013-10-22T04:17:42.849+0000 [conn1] moving chunk ns: test.foo6 moving ( ns: test.foo6, shard: shard0001:localhost:30001, lastmod: 1|2||000000000000000000000000, min: { a: 2.0 }, max: { a: MaxKey }) shard0001:localhost:30001 -> shard0000:localhost:30000
       m30001| 2013-10-22T04:17:42.850+0000 [conn4] received moveChunk request: { moveChunk: "test.foo6", from: "localhost:30001", to: "localhost:30000", fromShard: "shard0001", toShard: "shard0000", min: { a: 2.0 }, max: { a: MaxKey }, maxChunkSizeBytes: 52428800, shardId: "test.foo6-a_2.0", configdb: "localhost:30000", secondaryThrottle: false, waitForDelete: false }
       m30001| 2013-10-22T04:17:42.853+0000 [conn4] distributed lock 'test.foo6/EC2AMAZ-ZVRDQB5:30001:1382415456:41' acquired, ts : 5265fc66ee4bb720b15bc873
       m30001| 2013-10-22T04:17:42.853+0000 [conn4] about to log metadata event: { _id: "EC2AMAZ-ZVRDQB5-2013-10-22T04:17:42-5265fc66ee4bb720b15bc874", server: "EC2AMAZ-ZVRDQB5", clientAddr: "127.0.0.1:55273", time: new Date(1382415462853), what: "moveChunk.start", ns: "test.foo6", details: { min: { a: 2.0 }, max: { a: MaxKey }, from: "shard0001", to: "shard0000" } }
       m30001| 2013-10-22T04:17:42.854+0000 [conn4] remotely refreshing metadata for test.foo6 based on current shard version 1|2||5265fc6639672851e14c5984
       m30001| 2013-10-22T04:17:42.855+0000 [conn4] newer metadata not found for test.foo6 (loaded version : 1|2||5265fc6639672851e14c5984, stored version : 1|2||5265fc6639672851e14c5984, took 1ms)
       m30001| 2013-10-22T04:17:42.856+0000 [conn4] moveChunk request accepted at version 1|2||5265fc6639672851e14c5984
       m30001| 2013-10-22T04:17:42.856+0000 [conn4] moveChunk number of documents: 2
       m30001| 2013-10-22T04:17:42.856+0000 [conn4] warning: moveChunk failed to engage TO-shard in the data transfer: migrate already in progress
       m30001| 2013-10-22T04:17:42.856+0000 [conn4] MigrateFromStatus::done About to acquire global write lock to exit critical section
       m30001| 2013-10-22T04:17:42.856+0000 [conn4] MigrateFromStatus::done Global lock acquired
       m30001| 2013-10-22T04:17:42.857+0000 [conn4] distributed lock 'test.foo6/EC2AMAZ-ZVRDQB5:30001:1382415456:41' unlocked. 
       m30001| 2013-10-22T04:17:42.857+0000 [conn4] about to log metadata event: { _id: "EC2AMAZ-ZVRDQB5-2013-10-22T04:17:42-5265fc66ee4bb720b15bc875", server: "EC2AMAZ-ZVRDQB5", clientAddr: "127.0.0.1:55273", time: new Date(1382415462857), what: "moveChunk.from", ns: "test.foo6", details: { min: { a: 2.0 }, max: { a: MaxKey }, step 1 of 6: 0, step 2 of 6: 5, note: "aborted" } }
       m30999| 2013-10-22T04:17:42.857+0000 [conn1] moveChunk result: { cause: { ok: 0.0, errmsg: "migrate already in progress" }, ok: 0.0, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress" }
       m30001| 2013-10-22T04:17:42.862+0000 [RangeDeleter] Deleter starting delete for: test.foo4 from { num: 10.0 } -> { num: MaxKey }, with opId: 126
       m30001| 2013-10-22T04:17:42.862+0000 [RangeDeleter] rangeDeleter deleted 0 documents for test.foo4 from { num: 10.0 } -> { num: MaxKey }
      2013-10-22T04:17:42.872+0000 command {
      	"movechunk" : "test.foo6",
      	"find" : {
      		"a" : 3
      	},
      	"to" : "localhost:30000"
      } failed: {
      	"cause" : {
      		"cause" : {
      			"ok" : 0,
      			"errmsg" : "migrate already in progress"
      		},
      		"ok" : 0,
      		"errmsg" : "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress"
      	},
      	"ok" : 0,
      	"errmsg" : "move failed"
      } at src/mongo/shell/shardingtest.js:558
      failed to load: D:\data\mci\git@github.commongodb\mongo.git\master\jstests\sharding\features1.js
      

            Assignee:
            benety.goh@mongodb.com Benety Goh
            Reporter:
            matt.kangas Matt Kangas
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: