Background
cleanupOrphaned avoids shard key ranges of pending chunks. This makes sense, since these orphans are expected to soon be owned documents, and removing them would interfere with the migration process and be pointlessly counter-productive.
Problem
If the migration succeeds on the TO shard, but then fails on the FROM shard (or otherwise fails to be committed), cleanupOrphaned on the TO shard continues to avoid that chunk range forever, even though the shard does not own that range and there are no in-progress migrations.
This means that orphans in this range (notably the orphans from the failed migration itself) can no longer be removed by cleanupOrphaned. When the chunk migration is subsequently retried, if it ends up going to another shard, this situation persists forever on the original TO shard primary.
This manifests at both failMigrationConfigWritePrepare and failMigrationCommit failpoints on the FROM shard side (jstests attached).
There are no failpoints on the TO shard side, so I couldn't properly test if a failure there also causes the problem or not.
Workaround
Stepping down the affected primary allows the new ChunkManager to correctly determine the owned chunk ranges (directly from the config metadata) that cleanupOrphaned should avoid.
Results
Split at 0:
m30000| 2015-01-27T17:38:38.149+1100 [conn4] received splitChunk request: { splitChunk: "foo.bar", keyPattern: { _id: 1.0 }, min: { _id: MinKey }, max: { _id: MaxKey }, from: "shard0000", splitKeys: [ { _id: 0.0 } ], shardId: "foo.bar-_id_MinKey", configdb: "localhost:29000" }
|
m30000| 2015-01-27T17:38:38.153+1100 [conn4] splitChunk accepted at version 1|0||54c7326e866945453e19933e
|
Status:
--- Sharding Status ---
|
sharding version: {
|
"_id" : 1,
|
"version" : 4,
|
"minCompatibleVersion" : 4,
|
"currentVersion" : 5,
|
"clusterId" : ObjectId("54c7326d866945453e199337")
|
}
|
shards:
|
{ "_id" : "shard0000", "host" : "localhost:30000" }
|
{ "_id" : "shard0001", "host" : "localhost:30001" }
|
databases:
|
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
|
{ "_id" : "foo", "partitioned" : true, "primary" : "shard0000" }
|
foo.bar
|
shard key: { "_id" : 1 }
|
chunks:
|
shard0000 2
|
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : 0 } on : shard0000 Timestamp(1, 1)
|
{ "_id" : 0 } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 2)
|
Deliberately insert an orphan on 2nd shard, and clean it up — note the full MinKey -> MaxKey range is considered by the rangeDeleter (which is correct):
m30001| 2015-01-27T17:38:38.358+1100 [conn1] insert foo.bar query: { _id: 1.0 } ninserted:1 keyUpdates:0 numYields:0 locks(micros) w:143702 143ms
|
m30001| 2015-01-27T17:38:38.358+1100 [conn1] command foo.$cmd command: insert { insert: "bar", documents: [ { _id: 1.0 } ], ordered: true } keyUpdates:0 numYields:0 locks(micros) w:31 reslen:40 143ms
|
m30001| 2015-01-27T17:38:38.359+1100 [conn1] remotely refreshing metadata for foo.bar, current shard version is 0|0||000000000000000000000000, current metadata version is 0|0||000000000000000000000000
|
m29000| 2015-01-27T17:38:38.359+1100 [initandlisten] connection accepted from 127.0.0.1:50549 #11 (10 connections now open)
|
m30001| 2015-01-27T17:38:38.360+1100 [conn1] collection foo.bar was previously unsharded, new metadata loaded with shard version 0|0||54c7326e866945453e19933e
|
m30001| 2015-01-27T17:38:38.360+1100 [conn1] collection version was loaded at version 1|2||54c7326e866945453e19933e, took 0ms
|
m30001| 2015-01-27T17:38:38.360+1100 [conn1] Deleter starting delete for: foo.bar from { _id: MinKey } -> { _id: MaxKey }, with opId: 29
|
m30001| 2015-01-27T17:38:38.373+1100 [conn1] rangeDeleter deleted 1 documents for foo.bar from { _id: MinKey } -> { _id: MaxKey }
|
Failed migration of the 0 -> MaxKey chunk (1 document) (for the failMigrationCommit case):
m30999| 2015-01-27T17:38:38.374+1100 [conn1] CMD: movechunk: { moveChunk: "foo.bar", find: { _id: 0.0 }, to: "shard0001" }
|
m30999| 2015-01-27T17:38:38.374+1100 [conn1] moving chunk ns: foo.bar moving ( ns: foo.bar, shard: shard0000:localhost:30000, lastmod: 1|2||000000000000000000000000, min: { _id: 0.0 }, max: { _id: MaxKey }) shard0000:localhost:30000 -> shard0001:localhost:30001
|
m30000| 2015-01-27T17:38:38.375+1100 [conn4] received moveChunk request: { moveChunk: "foo.bar", from: "localhost:30000", to: "localhost:30001", fromShard: "shard0000", toShard: "shard0001", min: { _id: 0.0 }, max: { _id: MaxKey }, maxChunkSizeBytes: 52428800, shardId: "foo.bar-_id_0.0", configdb: "localhost:29000", secondaryThrottle: false, waitForDelete: false, maxTimeMS: 0 }
|
m30000| 2015-01-27T17:38:38.376+1100 [conn4] distributed lock 'foo.bar/genique:30000:1422340718:1213894827' acquired, ts : 54c7326e88b748633811c1e5
|
m30000| 2015-01-27T17:38:38.376+1100 [conn4] about to log metadata event: { _id: "genique-2015-01-27T06:38:38-54c7326e88b748633811c1e6", server: "genique", clientAddr: "127.0.0.1:37591", time: new Date(1422340718376), what: "moveChunk.start", ns: "foo.bar", details: { min: { _id: 0.0 }, max: { _id: MaxKey }, from: "shard0000", to: "shard0001" } }
|
m29000| 2015-01-27T17:38:38.376+1100 [conn10] CMD fsync: sync:1 lock:0
|
m30000| 2015-01-27T17:38:38.397+1100 [conn4] remotely refreshing metadata for foo.bar based on current shard version 1|2||54c7326e866945453e19933e, current metadata version is 1|2||54c7326e866945453e19933e
|
m30000| 2015-01-27T17:38:38.398+1100 [conn4] metadata of collection foo.bar already up to date (shard version : 1|2||54c7326e866945453e19933e, took 0ms)
|
m30000| 2015-01-27T17:38:38.398+1100 [conn4] moveChunk request accepted at version 1|2||54c7326e866945453e19933e
|
m30000| 2015-01-27T17:38:38.398+1100 [conn4] moveChunk number of documents: 1
|
m30001| 2015-01-27T17:38:38.398+1100 [initandlisten] connection accepted from 127.0.0.1:59845 #5 (5 connections now open)
|
m30001| 2015-01-27T17:38:38.398+1100 [conn5] remotely refreshing metadata for foo.bar based on current shard version 0|0||54c7326e866945453e19933e, current metadata version is 1|2||54c7326e866945453e19933e
|
m30001| 2015-01-27T17:38:38.399+1100 [conn5] metadata of collection foo.bar already up to date (shard version : 0|0||54c7326e866945453e19933e, took 0ms)
|
m30001| 2015-01-27T17:38:38.399+1100 [migrateThread] starting receiving-end of migration of chunk { _id: 0.0 } -> { _id: MaxKey } for collection foo.bar from localhost:30000 at epoch 54c7326e866945453e19933e
|
m30000| 2015-01-27T17:38:38.399+1100 [initandlisten] connection accepted from 127.0.0.1:37598 #5 (5 connections now open)
|
m30000| 2015-01-27T17:38:38.400+1100 [conn4] moveChunk data transfer progress: { active: true, ns: "foo.bar", from: "localhost:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "clone", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
|
m30001| 2015-01-27T17:38:38.400+1100 [migrateThread] Waiting for replication to catch up before entering critical section
|
m30001| 2015-01-27T17:38:38.400+1100 [migrateThread] migrate commit succeeded flushing to secondaries for 'foo.bar' { _id: 0.0 } -> { _id: MaxKey }
|
m30000| 2015-01-27T17:38:38.402+1100 [conn4] moveChunk data transfer progress: { active: true, ns: "foo.bar", from: "localhost:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "catchup", counts: { cloned: 1, clonedBytes: 18, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
|
m30000| 2015-01-27T17:38:38.407+1100 [conn4] moveChunk data transfer progress: { active: true, ns: "foo.bar", from: "localhost:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "catchup", counts: { cloned: 1, clonedBytes: 18, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
|
m30001| 2015-01-27T17:38:38.413+1100 [migrateThread] migrate commit flushed to journal for 'foo.bar' { _id: 0.0 } -> { _id: MaxKey }
|
m30000| 2015-01-27T17:38:38.415+1100 [conn4] moveChunk data transfer progress: { active: true, ns: "foo.bar", from: "localhost:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "steady", counts: { cloned: 1, clonedBytes: 18, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
|
m30000| 2015-01-27T17:38:38.415+1100 [conn4] About to check if it is safe to enter critical section
|
m30000| 2015-01-27T17:38:38.415+1100 [conn4] About to enter migrate critical section
|
m30000| 2015-01-27T17:38:38.415+1100 [conn4] moveChunk setting version to: 2|0||54c7326e866945453e19933e
|
m30001| 2015-01-27T17:38:38.416+1100 [initandlisten] connection accepted from 127.0.0.1:59847 #6 (6 connections now open)
|
m30001| 2015-01-27T17:38:38.423+1100 [migrateThread] migrate commit succeeded flushing to secondaries for 'foo.bar' { _id: 0.0 } -> { _id: MaxKey }
|
m30001| 2015-01-27T17:38:38.423+1100 [migrateThread] migrate commit flushed to journal for 'foo.bar' { _id: 0.0 } -> { _id: MaxKey }
|
m30001| 2015-01-27T17:38:38.423+1100 [migrateThread] about to log metadata event: { _id: "genique-2015-01-27T06:38:38-54c7326e5b94b105b43d330c", server: "genique", clientAddr: ":27017", time: new Date(1422340718423), what: "moveChunk.to", ns: "foo.bar", details: { min: { _id: 0.0 }, max: { _id: MaxKey }, step 1 of 5: 0, step 2 of 5: 0, step 3 of 5: 0, step 4 of 5: 0, step 5 of 5: 23, note: "success" } }
|
m29000| 2015-01-27T17:38:38.424+1100 [initandlisten] connection accepted from 127.0.0.1:50553 #12 (11 connections now open)
|
m29000| 2015-01-27T17:38:38.424+1100 [conn12] CMD fsync: sync:1 lock:0
|
m30000| 2015-01-27T17:38:38.459+1100 [conn4] moveChunk migrate commit not accepted by TO-shard: { active: false, ns: "foo.bar", from: "localhost:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "done", counts: { cloned: 1, clonedBytes: 18, catchup: 0, steady: 0 }, ok: 1.0 } resetting shard version to: 0|0||000000000000000000000000
|
m30000| 2015-01-27T17:38:38.459+1100 [conn4] moveChunk global lock acquired to reset shard version from failed migration
|
m30000| 2015-01-27T17:38:38.459+1100 [conn4] ShardingState::undoDonateChunk acquired _mutex
|
m30000| 2015-01-27T17:38:38.459+1100 [conn4] Shard version successfully reset to clean up failed migration
|
m30000| 2015-01-27T17:38:38.459+1100 [conn4] MigrateFromStatus::done About to acquire global write lock to exit critical section
|
m30000| 2015-01-27T17:38:38.459+1100 [conn4] MigrateFromStatus::done Global lock acquired
|
m30000| 2015-01-27T17:38:38.459+1100 [conn4] distributed lock 'foo.bar/genique:30000:1422340718:1213894827' unlocked.
|
m30000| 2015-01-27T17:38:38.459+1100 [conn4] about to log metadata event: { _id: "genique-2015-01-27T06:38:38-54c7326e88b748633811c1e7", server: "genique", clientAddr: "127.0.0.1:37591", time: new Date(1422340718459), what: "moveChunk.from", ns: "foo.bar", details: { min: { _id: 0.0 }, max: { _id: MaxKey }, step 1 of 6: 0, step 2 of 6: 23, step 3 of 6: 1, step 4 of 6: 16, note: "aborted", errmsg: "_recvChunkCommit failed!" } }
|
m29000| 2015-01-27T17:38:38.460+1100 [conn10] CMD fsync: sync:1 lock:0
|
m30000| 2015-01-27T17:38:38.494+1100 [conn4] command admin.$cmd command: moveChunk { moveChunk: "foo.bar", from: "localhost:30000", to: "localhost:30001", fromShard: "shard0000", toShard: "shard0001", min: { _id: 0.0 }, max: { _id: MaxKey }, maxChunkSizeBytes: 52428800, shardId: "foo.bar-_id_0.0", configdb: "localhost:29000", secondaryThrottle: false, waitForDelete: false, maxTimeMS: 0 } ntoreturn:1 keyUpdates:0 numYields:0 locks(micros) W:32 r:67 w:44 reslen:321 120ms
|
m30999| 2015-01-27T17:38:38.494+1100 [conn1] moveChunk result: { cause: { active: false, ns: "foo.bar", from: "localhost:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "done", counts: { cloned: 1, clonedBytes: 18, catchup: 0, steady: 0 }, ok: 1.0 }, ok: 0.0, errmsg: "_recvChunkCommit failed!" }
|
After the failed migration (which has left behind 1 orphan), the rangeDeleter is only checking MinKey -> 0. The range 0 -> MaxKey is not scanned, and so the orphan is not cleaned:
m30001| 2015-01-27T17:38:38.496+1100 [conn1] remotely refreshing metadata for foo.bar based on current shard version 0|0||54c7326e866945453e19933e, current metadata version is 1|2||54c7326e866945453e19933e
|
m30001| 2015-01-27T17:38:38.496+1100 [conn1] metadata of collection foo.bar already up to date (shard version : 0|0||54c7326e866945453e19933e, took 0ms)
|
m30001| 2015-01-27T17:38:38.496+1100 [conn1] Deleter starting delete for: foo.bar from { _id: MinKey } -> { _id: 0.0 }, with opId: 41
|
m30001| 2015-01-27T17:38:38.496+1100 [conn1] rangeDeleter deleted 0 documents for foo.bar from { _id: MinKey } -> { _id: 0.0 }
|
|