[SERVER-11301] moveChunk command failing in some sharding tests on slow hosts Created: 22/Oct/13  Updated: 11/Jul/16  Resolved: 01/Nov/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.5.4

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Benety Goh
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-10232 better migration stats to detect TO-s... Closed
is related to SERVER-11122 moveChunk fails in sharding/hash_basi... Closed
Backwards Compatibility: Fully Compatible
Operating System: Windows
Participants:

 Description   

This affects the following test scripts:

  • jstests/sharding/cleanup_orphaned_cmd_prereload.js
  • jstests/sharding/features1.js
  • jstests/sharding/key_many.js

When a moveChunk is issued with a destination mongod process that is still finishing a previous moveChunk request, the new moveChunk command will fail with a "migrate already in progress" error.

Resolution:

Especially when the moveChunk command is not the primary goal of the test, it is fine to wrap the moveChunk command in a limited retry logic to accommodate slow hosts.

3b87e21f1151 Windows 32-bit sharding : features1.js (logfile)

 m30999| 2013-10-22T04:17:42.826+0000 [conn1] CMD: shardcollection: { shardcollection: "test.foo6", key: { a: 1.0 } }
 m30999| 2013-10-22T04:17:42.827+0000 [conn1] enable sharding on: test.foo6 with shard key: { a: 1.0 }
 m30999| 2013-10-22T04:17:42.827+0000 [conn1] about to log metadata event: { _id: "EC2AMAZ-ZVRDQB5-2013-10-22T04:17:42-5265fc6639672851e14c5983", server: "EC2AMAZ-ZVRDQB5", clientAddr: "N/A", time: new Date(1382415462827), what: "shardCollection.start", ns: "test.foo6", details: { shardKey: { a: 1.0 }, collection: "test.foo6", primary: "shard0001:localhost:30001", initShards: [], numChunks: 1 } }
 m30999| 2013-10-22T04:17:42.828+0000 [conn1] going to create 1 chunk(s) for: test.foo6 using new epoch 5265fc6639672851e14c5984
 m30999| 2013-10-22T04:17:42.829+0000 [conn1] ChunkManager: time to load chunks for test.foo6: 0ms sequenceNumber: 10 version: 1|0||5265fc6639672851e14c5984 based on: (empty)
 m30999| 2013-10-22T04:17:42.830+0000 [conn1]     setShardVersion  shard0001 localhost:30001  test.foo6  { setShardVersion: "test.foo6", configdb: "localhost:30000", version: Timestamp 1000|0, versionEpoch: ObjectId('5265fc6639672851e14c5984'), serverID: ObjectId('5265fc6039672851e14c5973'), shard: "shard0001", shardHost: "localhost:30001" } 00B2D5B0 10
 m30999| 2013-10-22T04:17:42.830+0000 [conn1]        setShardVersion failed!
 m30999| { oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ns: "test.foo6", need_authoritative: true, ok: 0.0, errmsg: "first time for collection 'test.foo6'" }
 m30999| 2013-10-22T04:17:42.831+0000 [conn1]     setShardVersion  shard0001 localhost:30001  test.foo6  { setShardVersion: "test.foo6", configdb: "localhost:30000", version: Timestamp 1000|0, versionEpoch: ObjectId('5265fc6639672851e14c5984'), serverID: ObjectId('5265fc6039672851e14c5973'), authoritative: true, shard: "shard0001", shardHost: "localhost:30001" } 00B2D5B0 10
 m30001| 2013-10-22T04:17:42.831+0000 [conn3] remotely refreshing metadata for test.foo6 with requested shard version 1|0||5265fc6639672851e14c5984, current shard version is 0|0||000000000000000000000000
 m30001| 2013-10-22T04:17:42.833+0000 [conn3] loaded new metadata for test.foo6 (loaded version : 1|0||5265fc6639672851e14c5984, stored version : 0|0||000000000000000000000000, took 1ms)
 m30999| 2013-10-22T04:17:42.833+0000 [conn1]       setShardVersion success: { oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ok: 1.0 }
 m30999| 2013-10-22T04:17:42.833+0000 [conn1] about to log metadata event: { _id: "EC2AMAZ-ZVRDQB5-2013-10-22T04:17:42-5265fc6639672851e14c5985", server: "EC2AMAZ-ZVRDQB5", clientAddr: "N/A", time: new Date(1382415462833), what: "shardCollection", ns: "test.foo6", details: { version: "1|0||5265fc6639672851e14c5984" } }
 m30999| 2013-10-22T04:17:42.835+0000 [conn1] splitting chunk [{ a: MinKey },{ a: MaxKey }) in collection test.foo6 on shard shard0001
 m30001| 2013-10-22T04:17:42.836+0000 [conn4] received splitChunk request: { splitChunk: "test.foo6", keyPattern: { a: 1.0 }, min: { a: MinKey }, max: { a: MaxKey }, from: "shard0001", splitKeys: [ { a: 2.0 } ], shardId: "test.foo6-a_MinKey", configdb: "localhost:30000" }
 m30001| 2013-10-22T04:17:42.838+0000 [conn4] distributed lock 'test.foo6/EC2AMAZ-ZVRDQB5:30001:1382415456:41' acquired, ts : 5265fc66ee4bb720b15bc871
 m30001| 2013-10-22T04:17:42.839+0000 [conn4] remotely refreshing metadata for test.foo6 based on current shard version 1|0||5265fc6639672851e14c5984
 m30001| 2013-10-22T04:17:42.841+0000 [conn4] newer metadata not found for test.foo6 (loaded version : 1|0||5265fc6639672851e14c5984, stored version : 1|0||5265fc6639672851e14c5984, took 1ms)
 m30001| 2013-10-22T04:17:42.841+0000 [conn4] splitChunk accepted at version 1|0||5265fc6639672851e14c5984
 m30001| 2013-10-22T04:17:42.842+0000 [conn4] about to log metadata event: { _id: "EC2AMAZ-ZVRDQB5-2013-10-22T04:17:42-5265fc66ee4bb720b15bc872", server: "EC2AMAZ-ZVRDQB5", clientAddr: "127.0.0.1:55273", time: new Date(1382415462842), what: "split", ns: "test.foo6", details: { before: { min: { a: MinKey }, max: { a: MaxKey }, lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('000000000000000000000000') }, left: { min: { a: MinKey }, max: { a: 2.0 }, lastmod: Timestamp 1000|1, lastmodEpoch: ObjectId('5265fc6639672851e14c5984') }, right: { min: { a: 2.0 }, max: { a: MaxKey }, lastmod: Timestamp 1000|2, lastmodEpoch: ObjectId('5265fc6639672851e14c5984') } } }
 m30001| 2013-10-22T04:17:42.845+0000 [conn4] distributed lock 'test.foo6/EC2AMAZ-ZVRDQB5:30001:1382415456:41' unlocked. 
 m30999| 2013-10-22T04:17:42.846+0000 [conn1] ChunkManager: time to load chunks for test.foo6: 0ms sequenceNumber: 11 version: 1|2||5265fc6639672851e14c5984 based on: 1|0||5265fc6639672851e14c5984
 m30999| 2013-10-22T04:17:42.849+0000 [conn1] CMD: movechunk: { movechunk: "test.foo6", find: { a: 3.0 }, to: "localhost:30000" }
 m30999| 2013-10-22T04:17:42.849+0000 [conn1] moving chunk ns: test.foo6 moving ( ns: test.foo6, shard: shard0001:localhost:30001, lastmod: 1|2||000000000000000000000000, min: { a: 2.0 }, max: { a: MaxKey }) shard0001:localhost:30001 -> shard0000:localhost:30000
 m30001| 2013-10-22T04:17:42.850+0000 [conn4] received moveChunk request: { moveChunk: "test.foo6", from: "localhost:30001", to: "localhost:30000", fromShard: "shard0001", toShard: "shard0000", min: { a: 2.0 }, max: { a: MaxKey }, maxChunkSizeBytes: 52428800, shardId: "test.foo6-a_2.0", configdb: "localhost:30000", secondaryThrottle: false, waitForDelete: false }
 m30001| 2013-10-22T04:17:42.853+0000 [conn4] distributed lock 'test.foo6/EC2AMAZ-ZVRDQB5:30001:1382415456:41' acquired, ts : 5265fc66ee4bb720b15bc873
 m30001| 2013-10-22T04:17:42.853+0000 [conn4] about to log metadata event: { _id: "EC2AMAZ-ZVRDQB5-2013-10-22T04:17:42-5265fc66ee4bb720b15bc874", server: "EC2AMAZ-ZVRDQB5", clientAddr: "127.0.0.1:55273", time: new Date(1382415462853), what: "moveChunk.start", ns: "test.foo6", details: { min: { a: 2.0 }, max: { a: MaxKey }, from: "shard0001", to: "shard0000" } }
 m30001| 2013-10-22T04:17:42.854+0000 [conn4] remotely refreshing metadata for test.foo6 based on current shard version 1|2||5265fc6639672851e14c5984
 m30001| 2013-10-22T04:17:42.855+0000 [conn4] newer metadata not found for test.foo6 (loaded version : 1|2||5265fc6639672851e14c5984, stored version : 1|2||5265fc6639672851e14c5984, took 1ms)
 m30001| 2013-10-22T04:17:42.856+0000 [conn4] moveChunk request accepted at version 1|2||5265fc6639672851e14c5984
 m30001| 2013-10-22T04:17:42.856+0000 [conn4] moveChunk number of documents: 2
 m30001| 2013-10-22T04:17:42.856+0000 [conn4] warning: moveChunk failed to engage TO-shard in the data transfer: migrate already in progress
 m30001| 2013-10-22T04:17:42.856+0000 [conn4] MigrateFromStatus::done About to acquire global write lock to exit critical section
 m30001| 2013-10-22T04:17:42.856+0000 [conn4] MigrateFromStatus::done Global lock acquired
 m30001| 2013-10-22T04:17:42.857+0000 [conn4] distributed lock 'test.foo6/EC2AMAZ-ZVRDQB5:30001:1382415456:41' unlocked. 
 m30001| 2013-10-22T04:17:42.857+0000 [conn4] about to log metadata event: { _id: "EC2AMAZ-ZVRDQB5-2013-10-22T04:17:42-5265fc66ee4bb720b15bc875", server: "EC2AMAZ-ZVRDQB5", clientAddr: "127.0.0.1:55273", time: new Date(1382415462857), what: "moveChunk.from", ns: "test.foo6", details: { min: { a: 2.0 }, max: { a: MaxKey }, step 1 of 6: 0, step 2 of 6: 5, note: "aborted" } }
 m30999| 2013-10-22T04:17:42.857+0000 [conn1] moveChunk result: { cause: { ok: 0.0, errmsg: "migrate already in progress" }, ok: 0.0, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress" }
 m30001| 2013-10-22T04:17:42.862+0000 [RangeDeleter] Deleter starting delete for: test.foo4 from { num: 10.0 } -> { num: MaxKey }, with opId: 126
 m30001| 2013-10-22T04:17:42.862+0000 [RangeDeleter] rangeDeleter deleted 0 documents for test.foo4 from { num: 10.0 } -> { num: MaxKey }
2013-10-22T04:17:42.872+0000 command {
	"movechunk" : "test.foo6",
	"find" : {
		"a" : 3
	},
	"to" : "localhost:30000"
} failed: {
	"cause" : {
		"cause" : {
			"ok" : 0,
			"errmsg" : "migrate already in progress"
		},
		"ok" : 0,
		"errmsg" : "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress"
	},
	"ok" : 0,
	"errmsg" : "move failed"
} at src/mongo/shell/shardingtest.js:558
failed to load: D:\data\mci\git@github.commongodb\mongo.git\master\jstests\sharding\features1.js



 Comments   
Comment by Githook User [ 02/Dec/13 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-10232 better migration stats to detect TO-side readiness

Undo temporarly band aids by SERVER-11122 and SERVER-11301
Branch: master
https://github.com/mongodb/mongo/commit/ec176dd577988fc1aa2366de8152c5e97bb95466

Comment by auto [ 28/Oct/13 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-11301 added retry logic to movechunk commands for slow hosts
Branch: master
https://github.com/mongodb/mongo/commit/1cb14a267bf1509e50d31932e25e930af6493561

Generated at Thu Feb 08 03:25:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.