[SERVER-10232] better migration stats to detect TO-side readiness - shard3.js Created: 17/Jul/13  Updated: 11/Jul/16  Resolved: 02/Dec/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.5.5

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: build-failure
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

buildbot Nightly Linux 64-bit Build #1499
buildbot-special Nightly Linux 64-bit Subscription RHEL 6.2 Build #508
win32


Issue Links:
Related
related to SERVER-11301 moveChunk command failing in some sha... Closed
related to SERVER-11784 better migration stats Closed
is related to SERVER-11122 moveChunk fails in sharding/hash_basi... Closed
Operating System: ALL
Participants:

 Description   

What happened here?

Nightly Linux 64-bit Build #1499 July 16 (rev dd226d98a)

http://buildbot.mongodb.org/builders/Nightly%20Linux%2064-bit/builds/1499/steps/test_13/logs/stdio
http://buildlogs.mongodb.org/Nightly%20Linux%2064-bit/builds/1499/test/sharding/shard3.js

MOVE ALL TO 1
ShardingTest sharding collection info: test.foo
  mongod connection to localhost:30000 {  "configServer" : "localhost:30000",  "global" : Timestamp(2, 0),  "inShardedMode" : false,  "mine" : Timestamp(0, 0),  "ok" : 1 }
  mongod connection to localhost:30001 {  "configServer" : "localhost:30000",  "global" : Timestamp(2, 0),  "inShardedMode" : false,  "mine" : Timestamp(0, 0),  "ok" : 1 }
  mongos connection to localhost:30999 {  "version" : Timestamp(2, 1),  "versionEpoch" : ObjectId("51e5dd0c3c2dea6230dc3dc0"),  "ok" : 1 }
  mongos connection to localhost:30998 {  "version" : Timestamp(2, 1),  "versionEpoch" : ObjectId("51e5dd0c3c2dea6230dc3dc0"),  "ok" : 1 }
  test.foo-num_MinKey	2|1	{ "num" : { "$minKey" : 1 } } -> { "num" : 2 } shard0001  test.foo
  test.foo-num_2.0	2|0	{ "num" : 2 } -> { "num" : { "$maxKey" : 1 } } shard0000  test.foo
GOING TO MOVE
ShardingTest sharding collection info: test.foo
  mongod connection to localhost:30000 {  "configServer" : "localhost:30000",  "global" : Timestamp(2, 0),  "inShardedMode" : false,  "mine" : Timestamp(0, 0),  "ok" : 1 }
  mongod connection to localhost:30001 {  "configServer" : "localhost:30000",  "global" : Timestamp(2, 0),  "inShardedMode" : false,  "mine" : Timestamp(0, 0),  "ok" : 1 }
  mongos connection to localhost:30999 {  "version" : Timestamp(2, 1),  "versionEpoch" : ObjectId("51e5dd0c3c2dea6230dc3dc0"),  "ok" : 1 }
  mongos connection to localhost:30998 {  "version" : Timestamp(2, 1),  "versionEpoch" : ObjectId("51e5dd0c3c2dea6230dc3dc0"),  "ok" : 1 }
  test.foo-num_MinKey	2|1	{ "num" : { "$minKey" : 1 } } -> { "num" : 2 } shard0001  test.foo
  test.foo-num_2.0	2|0	{ "num" : 2 } -> { "num" : { "$maxKey" : 1 } } shard0000  test.foo
counts before move: { "0" : 3, "1" : 1 }
 m30000| Tue Jul 16 19:53:48.339 [initandlisten] connection accepted from 127.0.0.1:52658 #16 (16 connections now open)
 m30000| Tue Jul 16 19:53:48.344 [conn10] no current collection metadata found for this shard, will initialize
 m30000| Tue Jul 16 19:53:48.344 [initandlisten] connection accepted from 127.0.0.1:52659 #17 (17 connections now open)
 m30000| Tue Jul 16 19:53:48.346 [initandlisten] connection accepted from 127.0.0.1:52660 #18 (18 connections now open)
 m30001| Tue Jul 16 19:53:48.340 [conn4] about to log metadata event: { _id: "ip-10-2-29-40-2013-07-16T23:53:48-51e5dd0ca9f9ec0a6206d59e", server: "ip-10-2-29-40", clientAddr: "127.0.0.1:50351", time: new Date(1374018828340), what: "moveChunk.commit", ns: "test.foo", details: { min: { num: 2.0 }, max: { num: MaxKey }, from: "shard0001", to: "shard0000" } }
 m30001| Tue Jul 16 19:53:48.340 [conn4] MigrateFromStatus::done About to acquire global write lock to exit critical section
 m30001| Tue Jul 16 19:53:48.340 [conn4] MigrateFromStatus::done Global lock acquired
 m30001| Tue Jul 16 19:53:48.340 [conn4] doing delete inline for cleanup of chunk data
 m30001| Tue Jul 16 19:53:48.340 [conn4] Deleter starting delete for: test.foo from { num: 2.0 } -> { num: MaxKey }, with opId: 45
 m30001| Tue Jul 16 19:53:48.340 [conn4] rangeDeleter deleted 2 documents for test.foo from { num: 2.0 } -> { num: MaxKey }
 m30001| Tue Jul 16 19:53:48.340 [conn4] MigrateFromStatus::done About to acquire global write lock to exit critical section
 m30001| Tue Jul 16 19:53:48.340 [conn4] MigrateFromStatus::done Global lock acquired
 m30001| Tue Jul 16 19:53:48.341 [conn4] distributed lock 'test.foo/ip-10-2-29-40:30001:1374018828:1575425134' unlocked. 
 m30001| Tue Jul 16 19:53:48.341 [conn4] about to log metadata event: { _id: "ip-10-2-29-40-2013-07-16T23:53:48-51e5dd0ca9f9ec0a6206d59f", server: "ip-10-2-29-40", clientAddr: "127.0.0.1:50351", time: new Date(1374018828341), what: "moveChunk.from", ns: "test.foo", details: { min: { num: 2.0 }, max: { num: MaxKey }, step1 of 6: 0, step2 of 6: 1, step3 of 6: 0, step4 of 6: 282, step5 of 6: 10, step6 of 6: 0 } }
 m30001| Tue Jul 16 19:53:48.341 [conn4] command admin.$cmd command: { moveChunk: "test.foo", from: "localhost:30001", to: "localhost:30000", fromShard: "shard0001", toShard: "shard0000", min: { num: 2.0 }, max: { num: MaxKey }, maxChunkSizeBytes: 52428800, shardId: "test.foo-num_2.0", configdb: "localhost:30000", secondaryThrottle: false, waitForDelete: true } ntoreturn:1 keyUpdates:0 locks(micros) W:22 r:86 w:155 reslen:37 295ms
 m30001| Tue Jul 16 19:53:48.368 [conn4] moveChunk waiting for full cleanup after move
 m30001| Tue Jul 16 19:53:48.368 [conn4] received moveChunk request: { moveChunk: "test.foo", from: "localhost:30001", to: "localhost:30000", fromShard: "shard0001", toShard: "shard0000", min: { num: MinKey }, max: { num: 2.0 }, maxChunkSizeBytes: 52428800, shardId: "test.foo-num_MinKey", configdb: "localhost:30000", secondaryThrottle: false, waitForDelete: true }
 m30001| Tue Jul 16 19:53:48.369 [conn4] distributed lock 'test.foo/ip-10-2-29-40:30001:1374018828:1575425134' acquired, ts : 51e5dd0ca9f9ec0a6206d5a0
 m30001| Tue Jul 16 19:53:48.369 [conn4] about to log metadata event: { _id: "ip-10-2-29-40-2013-07-16T23:53:48-51e5dd0ca9f9ec0a6206d5a1", server: "ip-10-2-29-40", clientAddr: "127.0.0.1:50351", time: new Date(1374018828369), what: "moveChunk.start", ns: "test.foo", details: { min: { num: MinKey }, max: { num: 2.0 }, from: "shard0001", to: "shard0000" } }
 m30001| Tue Jul 16 19:53:48.370 [conn4] moveChunk request accepted at version 2|1||51e5dd0c3c2dea6230dc3dc0
 m30001| Tue Jul 16 19:53:48.370 [conn4] moveChunk number of documents: 1
 m30001| Tue Jul 16 19:53:48.370 [conn4] warning: moveChunk failed to engage TO-shard in the data transfer: migrate already in progress
 m30001| Tue Jul 16 19:53:48.370 [conn4] MigrateFromStatus::done About to acquire global write lock to exit critical section
 m30001| Tue Jul 16 19:53:48.370 [conn4] MigrateFromStatus::done Global lock acquired
 m30001| Tue Jul 16 19:53:48.370 [conn4] distributed lock 'test.foo/ip-10-2-29-40:30001:1374018828:1575425134' unlocked. 
 m30001| Tue Jul 16 19:53:48.370 [conn4] about to log metadata event: { _id: "ip-10-2-29-40-2013-07-16T23:53:48-51e5dd0ca9f9ec0a6206d5a2", server: "ip-10-2-29-40", clientAddr: "127.0.0.1:50351", time: new Date(1374018828370), what: "moveChunk.from", ns: "test.foo", details: { min: { num: MinKey }, max: { num: 2.0 }, step1 of 6: 0, step2 of 6: 1, note: "aborted" } }
 m30999| Tue Jul 16 19:53:48.341 [conn1] moveChunk result: { ok: 1.0 }
 m30999| Tue Jul 16 19:53:48.342 [conn1] ChunkManager: time to load chunks for test.foo: 0ms sequenceNumber: 4 version: 2|1||51e5dd0c3c2dea6230dc3dc0 based on: 1|2||51e5dd0c3c2dea6230dc3dc0
 m30999| Tue Jul 16 19:53:48.343 [conn1]     setShardVersion  shard0000 localhost:30000  test.foo  { setShardVersion: "test.foo", configdb: "localhost:30000", version: Timestamp 2000|0, versionEpoch: ObjectId('51e5dd0c3c2dea6230dc3dc0'), serverID: ObjectId('51e5dd0b3c2dea6230dc3dbe'), shard: "shard0000", shardHost: "localhost:30000" } 0x2aaaabcae700 4
 m30999| Tue Jul 16 19:53:48.343 [conn1]        setShardVersion failed!
 m30999| { oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ns: "test.foo", need_authoritative: true, ok: 0.0, errmsg: "first time for collection 'test.foo'" }
 m30999| Tue Jul 16 19:53:48.344 [conn1]     setShardVersion  shard0000 localhost:30000  test.foo  { setShardVersion: "test.foo", configdb: "localhost:30000", version: Timestamp 2000|0, versionEpoch: ObjectId('51e5dd0c3c2dea6230dc3dc0'), serverID: ObjectId('51e5dd0b3c2dea6230dc3dbe'), authoritative: true, shard: "shard0000", shardHost: "localhost:30000" } 0x2aaaabcae700 4
 m30999| Tue Jul 16 19:53:48.345 [conn1]       setShardVersion success: { oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ok: 1.0 }
 m30999| Tue Jul 16 19:53:48.345 [conn1]     setShardVersion  shard0001 localhost:30001  test.foo  { setShardVersion: "test.foo", configdb: "localhost:30000", version: Timestamp 2000|1, versionEpoch: ObjectId('51e5dd0c3c2dea6230dc3dc0'), serverID: ObjectId('51e5dd0b3c2dea6230dc3dbe'), shard: "shard0001", shardHost: "localhost:30001" } 0x2aaaabcaea80 4
 m30999| Tue Jul 16 19:53:48.345 [conn1]       setShardVersion success: { oldVersion: Timestamp 1000|0, oldVersionEpoch: ObjectId('51e5dd0c3c2dea6230dc3dc0'), ok: 1.0 }
 m30999| Tue Jul 16 19:53:48.359 [conn1] ns: test.foo, shard: shard0001:localhost:30001, lastmod: 2|1||000000000000000000000000, min: { num: MinKey }, max: { num: 2.0 }
 m30999| Tue Jul 16 19:53:48.359 [conn1] ns: test.foo, shard: shard0000:localhost:30000, lastmod: 2|0||000000000000000000000000, min: { num: 2.0 }, max: { num: MaxKey }
 m30999| Tue Jul 16 19:53:48.364 [conn1] ns: test.foo, shard: shard0001:localhost:30001, lastmod: 2|1||000000000000000000000000, min: { num: MinKey }, max: { num: 2.0 }
 m30999| Tue Jul 16 19:53:48.364 [conn1] ns: test.foo, shard: shard0000:localhost:30000, lastmod: 2|0||000000000000000000000000, min: { num: 2.0 }, max: { num: MaxKey }
 m30999| Tue Jul 16 19:53:48.368 [conn1] CMD: movechunk: { movechunk: "test.foo", find: { num: 1.0 }, to: "localhost:30000", _waitForDelete: true }
 m30999| Tue Jul 16 19:53:48.368 [conn1] moving chunk ns: test.foo moving ( ns: test.foo, shard: shard0001:localhost:30001, lastmod: 2|1||000000000000000000000000, min: { num: MinKey }, max: { num: 2.0 }) shard0001:localhost:30001 -> shard0000:localhost:30000
 m30999| Tue Jul 16 19:53:48.370 [conn1] moveChunk result: { cause: { ok: 0.0, errmsg: "migrate already in progress" }, ok: 0.0, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress" }
 m30998| Tue Jul 16 19:53:48.346 [conn1] creating new connection to:localhost:30000
 m30998| Tue Jul 16 19:53:48.346 BackgroundJob starting: ConnectBG
 m30998| Tue Jul 16 19:53:48.346 [conn1] connected connection!
 m30998| Tue Jul 16 19:53:48.347 [conn1] ChunkManager: time to load chunks for test.foo: 0ms sequenceNumber: 3 version: 2|1||51e5dd0c3c2dea6230dc3dc0 based on: 1|0||51e5dd0c3c2dea6230dc3dc0
 m30998| Tue Jul 16 19:53:48.347 [conn1]     setShardVersion  shard0000 localhost:30000  test.foo  { setShardVersion: "test.foo", configdb: "localhost:30000", version: Timestamp 2000|0, versionEpoch: ObjectId('51e5dd0c3c2dea6230dc3dc0'), serverID: ObjectId('51e5dd0b736ed6f25a8961dc'), shard: "shard0000", shardHost: "localhost:30000" } 0x2aaaabcae7e0 3
 m30998| Tue Jul 16 19:53:48.347 [conn1]       setShardVersion success: { oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ok: 1.0 }
 m30998| Tue Jul 16 19:53:48.347 [conn1]     setShardVersion  shard0001 localhost:30001  test.foo  { setShardVersion: "test.foo", configdb: "localhost:30000", version: Timestamp 2000|1, versionEpoch: ObjectId('51e5dd0c3c2dea6230dc3dc0'), serverID: ObjectId('51e5dd0b736ed6f25a8961dc'), shard: "shard0001", shardHost: "localhost:30001" } 0x2aaaabcae700 3
 m30998| Tue Jul 16 19:53:48.347 [conn1]       setShardVersion success: { oldVersion: Timestamp 1000|0, oldVersionEpoch: ObjectId('51e5dd0c3c2dea6230dc3dc0'), ok: 1.0 }
 m30998| Tue Jul 16 19:53:48.360 [conn1] DBConfig unserialize: admin { _id: "admin", partitioned: false, primary: "config" }
 m30998| Tue Jul 16 19:53:48.360 [conn1] ns: test.foo, shard: shard0001:localhost:30001, lastmod: 2|1||000000000000000000000000, min: { num: MinKey }, max: { num: 2.0 }
 m30998| Tue Jul 16 19:53:48.360 [conn1] ns: test.foo, shard: shard0000:localhost:30000, lastmod: 2|0||000000000000000000000000, min: { num: 2.0 }, max: { num: MaxKey }
 m30998| Tue Jul 16 19:53:48.364 [conn1] ns: test.foo, shard: shard0001:localhost:30001, lastmod: 2|1||000000000000000000000000, min: { num: MinKey }, max: { num: 2.0 }
 m30998| Tue Jul 16 19:53:48.364 [conn1] ns: test.foo, shard: shard0000:localhost:30000, lastmod: 2|0||000000000000000000000000, min: { num: 2.0 }, max: { num: MaxKey }
Tue Jul 16 19:53:48.374 command {
	"movechunk" : "test.foo",
	"find" : {
		"num" : 1
	},
	"to" : "localhost:30000",
	"_waitForDelete" : true
} failed: {
	"cause" : {
		"cause" : {
			"ok" : 0,
			"errmsg" : "migrate already in progress"
		},
		"ok" : 0,
		"errmsg" : "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress"
	},
	"ok" : 0,
	"errmsg" : "move failed"
} at src/mongo/shell/shardingtest.js:558
failed to load: /mntfast/data/slave/Linux_64bit_Nightly/mongo/jstests/sharding/shard3.js



 Comments   
Comment by Githook User [ 02/Dec/13 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-10232 better migration stats to detect TO-side readiness

Undo temporarly band aids by SERVER-11122 and SERVER-11301
Branch: master
https://github.com/mongodb/mongo/commit/ec176dd577988fc1aa2366de8152c5e97bb95466

Comment by Greg Studer [ 19/Nov/13 ]

The above fixes this issue by a different approach (making the data transfer of migrations always stop before the migration is finished) - there's a separate ticket SERVER-11784 to track migration stats now.

Issue not closed since we can now roll back test ugliness.

Comment by Githook User [ 19/Nov/13 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-10232 wait for inactive on migration commit, not DONE
Branch: master
https://github.com/mongodb/mongo/commit/b303efd6b2ec39e2805fea8c814be02e0a1ca8bb

Comment by Daniel Pasette (Inactive) [ 18/Nov/13 ]

shard3.js has been failing on win32 in MCI very consistently the past few days. Might have to hack the test to get this out of the way.

http://mci.10gen.com/ui/task/mongodb_mongo_master_windows_32_c64ddd2d609e844ab4e9fcc2376f903e5f145d74_13_11_17_20_32_06_sharding_windows_32

Comment by Benety Goh [ 10/Oct/13 ]

shard3.js has been running fine under Windows 32-bit but hash_basic.js has been failing consistently.

Comment by Daniel Pasette (Inactive) [ 28/Aug/13 ]

Again on 2.4 linux-64:
http://buildlogs.mongodb.org/V2.4%20Linux%2064-bit/builds/192/test/sharding/shard3.js

Comment by Greg Studer [ 03/Aug/13 ]

This is benign, but hard to fix since it requires better stats about migration. Seeing if we can work this into 2.5.2.

Comment by Matt Kangas [ 17/Jul/13 ]

Same issue?

Nightly Linux 64-bit Subscription RHEL 6.2 Build #508 July 16

http://buildbot-special.10gen.cc:8081/builders/Nightly%20Linux%2064-bit%20Subscription%20RHEL%206.2/builds/508/steps/shell_3/logs/stdio
http://buildlogs.mongodb.org/Nightly%20Linux%2064-bit%20Subscription%20RHEL%206.2/builds/508/test/sharding/hash_basic.js

Tue Jul 16 17:24:14.999 m30001| Tue Jul 16 17:24:14.773 [conn4] run command admin.$cmd { moveChunk: "test.user", from: "localhost:30001", to: "localhost:30000", fromShard: "shard0001", toShard: "shard0000", min: { x: 4611686018427387902 }, max: { x: 9130110145486302901 }, maxChunkSizeBytes: 52428800, shardId: "test.user-x_4611686018427387902", configdb: "localhost:30000", secondaryThrottle: false, waitForDelete: true }
 m30001| Tue Jul 16 17:24:14.773 [conn4] moveChunk waiting for full cleanup after move
  m30001| Tue Jul 16 17:24:14.774 [conn4] received moveChunk request: { moveChunk: "test.user", from: "localhost:30001", to: "localhost:30000", fromShard: "shard0001", toShard: "shard0000", min: { x: 4611686018427387902 }, max: { x: 9130110145486302901 }, maxChunkSizeBytes: 52428800, shardId: "test.user-x_4611686018427387902", configdb: "localhost:30000", secondaryThrottle: false, waitForDelete: true }
assert failed : Cmd failed: {
	"cause" : {
		"cause" : {
			"ok" : 0,
			"errmsg" : "migrate already in progress"
		},
		"ok" : 0,
		"errmsg" : "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress"
	},
	"ok" : 0,
	"errmsg" : "move failed"
} at src/mongo/shell/assert.js:7

Generated at Thu Feb 08 03:22:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.