[SERVER-17066] cleanupOrphaned misses orphans after failed chunk migration Created: 27/Jan/15  Updated: 06/Dec/22  Resolved: 15/Nov/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File find-server-17066.js     File migration_failure_commit_cleanupOrphaned.js     File migration_failure_configWritePrepare_cleanupOrphaned.js    
Issue Links:
Depends
Related
is related to SERVER-18096 Shard primary incorrectly reuses clos... Closed
Assigned Teams:
Sharding EMEA
Operating System: ALL
Participants:

 Description   
Background

cleanupOrphaned avoids shard key ranges of pending chunks. This makes sense, since these orphans are expected to soon be owned documents, and removing them would interfere with the migration process and be pointlessly counter-productive.

Problem

If the migration succeeds on the TO shard, but then fails on the FROM shard (or otherwise fails to be committed), cleanupOrphaned on the TO shard continues to avoid that chunk range forever, even though the shard does not own that range and there are no in-progress migrations.

This means that orphans in this range (notably the orphans from the failed migration itself) can no longer be removed by cleanupOrphaned. When the chunk migration is subsequently retried, if it ends up going to another shard, this situation persists forever on the original TO shard primary.

This manifests at both failMigrationConfigWritePrepare and failMigrationCommit failpoints on the FROM shard side (jstests attached).

There are no failpoints on the TO shard side, so I couldn't properly test if a failure there also causes the problem or not.

Workaround

Stepping down the affected primary allows the new ChunkManager to correctly determine the owned chunk ranges (directly from the config metadata) that cleanupOrphaned should avoid.

Results

Split at 0:

 m30000| 2015-01-27T17:38:38.149+1100 [conn4] received splitChunk request: { splitChunk: "foo.bar", keyPattern: { _id: 1.0 }, min: { _id: MinKey }, max: { _id: MaxKey }, from: "shard0000", splitKeys: [ { _id: 0.0 } ], shardId: "foo.bar-_id_MinKey", configdb: "localhost:29000" }
 m30000| 2015-01-27T17:38:38.153+1100 [conn4] splitChunk accepted at version 1|0||54c7326e866945453e19933e

Status:

--- Sharding Status ---
  sharding version: {
        "_id" : 1,
        "version" : 4,
        "minCompatibleVersion" : 4,
        "currentVersion" : 5,
        "clusterId" : ObjectId("54c7326d866945453e199337")
}
  shards:
        {  "_id" : "shard0000",  "host" : "localhost:30000" }
        {  "_id" : "shard0001",  "host" : "localhost:30001" }
  databases:
        {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
        {  "_id" : "foo",  "partitioned" : true,  "primary" : "shard0000" }
                foo.bar
                        shard key: { "_id" : 1 }
                        chunks:
                                shard0000       2
                        { "_id" : { "$minKey" : 1 } } -->> { "_id" : 0 } on : shard0000 Timestamp(1, 1)
                        { "_id" : 0 } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 2)

Deliberately insert an orphan on 2nd shard, and clean it up — note the full MinKey -> MaxKey range is considered by the rangeDeleter (which is correct):

 m30001| 2015-01-27T17:38:38.358+1100 [conn1] insert foo.bar query: { _id: 1.0 } ninserted:1 keyUpdates:0 numYields:0 locks(micros) w:143702 143ms
 m30001| 2015-01-27T17:38:38.358+1100 [conn1] command foo.$cmd command: insert { insert: "bar", documents: [ { _id: 1.0 } ], ordered: true } keyUpdates:0 numYields:0 locks(micros) w:31 reslen:40 143ms
 m30001| 2015-01-27T17:38:38.359+1100 [conn1] remotely refreshing metadata for foo.bar, current shard version is 0|0||000000000000000000000000, current metadata version is 0|0||000000000000000000000000
 m29000| 2015-01-27T17:38:38.359+1100 [initandlisten] connection accepted from 127.0.0.1:50549 #11 (10 connections now open)
 m30001| 2015-01-27T17:38:38.360+1100 [conn1] collection foo.bar was previously unsharded, new metadata loaded with shard version 0|0||54c7326e866945453e19933e
 m30001| 2015-01-27T17:38:38.360+1100 [conn1] collection version was loaded at version 1|2||54c7326e866945453e19933e, took 0ms
 m30001| 2015-01-27T17:38:38.360+1100 [conn1] Deleter starting delete for: foo.bar from { _id: MinKey } -> { _id: MaxKey }, with opId: 29
 m30001| 2015-01-27T17:38:38.373+1100 [conn1] rangeDeleter deleted 1 documents for foo.bar from { _id: MinKey } -> { _id: MaxKey }

Failed migration of the 0 -> MaxKey chunk (1 document) (for the failMigrationCommit case):

 m30999| 2015-01-27T17:38:38.374+1100 [conn1] CMD: movechunk: { moveChunk: "foo.bar", find: { _id: 0.0 }, to: "shard0001" }
 m30999| 2015-01-27T17:38:38.374+1100 [conn1] moving chunk ns: foo.bar moving ( ns: foo.bar, shard: shard0000:localhost:30000, lastmod: 1|2||000000000000000000000000, min: { _id: 0.0 }, max: { _id: MaxKey }) shard0000:localhost:30000 -> shard0001:localhost:30001
 m30000| 2015-01-27T17:38:38.375+1100 [conn4] received moveChunk request: { moveChunk: "foo.bar", from: "localhost:30000", to: "localhost:30001", fromShard: "shard0000", toShard: "shard0001", min: { _id: 0.0 }, max: { _id: MaxKey }, maxChunkSizeBytes: 52428800, shardId: "foo.bar-_id_0.0", configdb: "localhost:29000", secondaryThrottle: false, waitForDelete: false, maxTimeMS: 0 }
 m30000| 2015-01-27T17:38:38.376+1100 [conn4] distributed lock 'foo.bar/genique:30000:1422340718:1213894827' acquired, ts : 54c7326e88b748633811c1e5
 m30000| 2015-01-27T17:38:38.376+1100 [conn4] about to log metadata event: { _id: "genique-2015-01-27T06:38:38-54c7326e88b748633811c1e6", server: "genique", clientAddr: "127.0.0.1:37591", time: new Date(1422340718376), what: "moveChunk.start", ns: "foo.bar", details: { min: { _id: 0.0 }, max: { _id: MaxKey }, from: "shard0000", to: "shard0001" } }
 m29000| 2015-01-27T17:38:38.376+1100 [conn10] CMD fsync: sync:1 lock:0
 m30000| 2015-01-27T17:38:38.397+1100 [conn4] remotely refreshing metadata for foo.bar based on current shard version 1|2||54c7326e866945453e19933e, current metadata version is 1|2||54c7326e866945453e19933e
 m30000| 2015-01-27T17:38:38.398+1100 [conn4] metadata of collection foo.bar already up to date (shard version : 1|2||54c7326e866945453e19933e, took 0ms)
 m30000| 2015-01-27T17:38:38.398+1100 [conn4] moveChunk request accepted at version 1|2||54c7326e866945453e19933e
 m30000| 2015-01-27T17:38:38.398+1100 [conn4] moveChunk number of documents: 1
 m30001| 2015-01-27T17:38:38.398+1100 [initandlisten] connection accepted from 127.0.0.1:59845 #5 (5 connections now open)
 m30001| 2015-01-27T17:38:38.398+1100 [conn5] remotely refreshing metadata for foo.bar based on current shard version 0|0||54c7326e866945453e19933e, current metadata version is 1|2||54c7326e866945453e19933e
 m30001| 2015-01-27T17:38:38.399+1100 [conn5] metadata of collection foo.bar already up to date (shard version : 0|0||54c7326e866945453e19933e, took 0ms)
 m30001| 2015-01-27T17:38:38.399+1100 [migrateThread] starting receiving-end of migration of chunk { _id: 0.0 } -> { _id: MaxKey } for collection foo.bar from localhost:30000 at epoch 54c7326e866945453e19933e
 m30000| 2015-01-27T17:38:38.399+1100 [initandlisten] connection accepted from 127.0.0.1:37598 #5 (5 connections now open)
 m30000| 2015-01-27T17:38:38.400+1100 [conn4] moveChunk data transfer progress: { active: true, ns: "foo.bar", from: "localhost:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "clone", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
 m30001| 2015-01-27T17:38:38.400+1100 [migrateThread] Waiting for replication to catch up before entering critical section
 m30001| 2015-01-27T17:38:38.400+1100 [migrateThread] migrate commit succeeded flushing to secondaries for 'foo.bar' { _id: 0.0 } -> { _id: MaxKey }
 m30000| 2015-01-27T17:38:38.402+1100 [conn4] moveChunk data transfer progress: { active: true, ns: "foo.bar", from: "localhost:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "catchup", counts: { cloned: 1, clonedBytes: 18, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
 m30000| 2015-01-27T17:38:38.407+1100 [conn4] moveChunk data transfer progress: { active: true, ns: "foo.bar", from: "localhost:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "catchup", counts: { cloned: 1, clonedBytes: 18, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
 m30001| 2015-01-27T17:38:38.413+1100 [migrateThread] migrate commit flushed to journal for 'foo.bar' { _id: 0.0 } -> { _id: MaxKey }
 m30000| 2015-01-27T17:38:38.415+1100 [conn4] moveChunk data transfer progress: { active: true, ns: "foo.bar", from: "localhost:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "steady", counts: { cloned: 1, clonedBytes: 18, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
 m30000| 2015-01-27T17:38:38.415+1100 [conn4] About to check if it is safe to enter critical section
 m30000| 2015-01-27T17:38:38.415+1100 [conn4] About to enter migrate critical section
 m30000| 2015-01-27T17:38:38.415+1100 [conn4] moveChunk setting version to: 2|0||54c7326e866945453e19933e
 m30001| 2015-01-27T17:38:38.416+1100 [initandlisten] connection accepted from 127.0.0.1:59847 #6 (6 connections now open)
 m30001| 2015-01-27T17:38:38.423+1100 [migrateThread] migrate commit succeeded flushing to secondaries for 'foo.bar' { _id: 0.0 } -> { _id: MaxKey }
 m30001| 2015-01-27T17:38:38.423+1100 [migrateThread] migrate commit flushed to journal for 'foo.bar' { _id: 0.0 } -> { _id: MaxKey }
 m30001| 2015-01-27T17:38:38.423+1100 [migrateThread] about to log metadata event: { _id: "genique-2015-01-27T06:38:38-54c7326e5b94b105b43d330c", server: "genique", clientAddr: ":27017", time: new Date(1422340718423), what: "moveChunk.to", ns: "foo.bar", details: { min: { _id: 0.0 }, max: { _id: MaxKey }, step 1 of 5: 0, step 2 of 5: 0, step 3 of 5: 0, step 4 of 5: 0, step 5 of 5: 23, note: "success" } }
 m29000| 2015-01-27T17:38:38.424+1100 [initandlisten] connection accepted from 127.0.0.1:50553 #12 (11 connections now open)
 m29000| 2015-01-27T17:38:38.424+1100 [conn12] CMD fsync: sync:1 lock:0
 m30000| 2015-01-27T17:38:38.459+1100 [conn4] moveChunk migrate commit not accepted by TO-shard: { active: false, ns: "foo.bar", from: "localhost:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "done", counts: { cloned: 1, clonedBytes: 18, catchup: 0, steady: 0 }, ok: 1.0 } resetting shard version to: 0|0||000000000000000000000000
 m30000| 2015-01-27T17:38:38.459+1100 [conn4] moveChunk global lock acquired to reset shard version from failed migration
 m30000| 2015-01-27T17:38:38.459+1100 [conn4] ShardingState::undoDonateChunk acquired _mutex
 m30000| 2015-01-27T17:38:38.459+1100 [conn4] Shard version successfully reset to clean up failed migration
 m30000| 2015-01-27T17:38:38.459+1100 [conn4] MigrateFromStatus::done About to acquire global write lock to exit critical section
 m30000| 2015-01-27T17:38:38.459+1100 [conn4] MigrateFromStatus::done Global lock acquired
 m30000| 2015-01-27T17:38:38.459+1100 [conn4] distributed lock 'foo.bar/genique:30000:1422340718:1213894827' unlocked.
 m30000| 2015-01-27T17:38:38.459+1100 [conn4] about to log metadata event: { _id: "genique-2015-01-27T06:38:38-54c7326e88b748633811c1e7", server: "genique", clientAddr: "127.0.0.1:37591", time: new Date(1422340718459), what: "moveChunk.from", ns: "foo.bar", details: { min: { _id: 0.0 }, max: { _id: MaxKey }, step 1 of 6: 0, step 2 of 6: 23, step 3 of 6: 1, step 4 of 6: 16, note: "aborted", errmsg: "_recvChunkCommit failed!" } }
 m29000| 2015-01-27T17:38:38.460+1100 [conn10] CMD fsync: sync:1 lock:0
 m30000| 2015-01-27T17:38:38.494+1100 [conn4] command admin.$cmd command: moveChunk { moveChunk: "foo.bar", from: "localhost:30000", to: "localhost:30001", fromShard: "shard0000", toShard: "shard0001", min: { _id: 0.0 }, max: { _id: MaxKey }, maxChunkSizeBytes: 52428800, shardId: "foo.bar-_id_0.0", configdb: "localhost:29000", secondaryThrottle: false, waitForDelete: false, maxTimeMS: 0 } ntoreturn:1 keyUpdates:0 numYields:0 locks(micros) W:32 r:67 w:44 reslen:321 120ms
 m30999| 2015-01-27T17:38:38.494+1100 [conn1] moveChunk result: { cause: { active: false, ns: "foo.bar", from: "localhost:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "done", counts: { cloned: 1, clonedBytes: 18, catchup: 0, steady: 0 }, ok: 1.0 }, ok: 0.0, errmsg: "_recvChunkCommit failed!" }

After the failed migration (which has left behind 1 orphan), the rangeDeleter is only checking MinKey -> 0. The range 0 -> MaxKey is not scanned, and so the orphan is not cleaned:

 m30001| 2015-01-27T17:38:38.496+1100 [conn1] remotely refreshing metadata for foo.bar based on current shard version 0|0||54c7326e866945453e19933e, current metadata version is 1|2||54c7326e866945453e19933e
 m30001| 2015-01-27T17:38:38.496+1100 [conn1] metadata of collection foo.bar already up to date (shard version : 0|0||54c7326e866945453e19933e, took 0ms)
 m30001| 2015-01-27T17:38:38.496+1100 [conn1] Deleter starting delete for: foo.bar from { _id: MinKey } -> { _id: 0.0 }, with opId: 41
 m30001| 2015-01-27T17:38:38.496+1100 [conn1] rangeDeleter deleted 0 documents for foo.bar from { _id: MinKey } -> { _id: 0.0 }



 Comments   
Comment by Kaloian Manassiev [ 15/Nov/21 ]

Gone away as of 4.4 as a result of the resumable range deleter project.

Comment by Kevin Pulo [ 03/Feb/15 ]

Attached script to identify occurrences from config.changelog. This helps to determine which shard primaries need to be stepped down to workaround this issue.

Comment by Randolph Tan [ 30/Jan/15 ]

Once the recipient side has completed it's side of the migration, the pending status for a range is cleared when it refreshes it's metadata and sees that it now owns the chunk. The problem is that the commit was not successful, it will remain in the pending state indefinitely. The reason why it doesn't clear even if the chunk is moved to another shard is because the current refresh logic filters out chunks it doesn't own. One potential fix is to add the "pending chunks" to our differ logic in some way.

Comment by Kevin Pulo [ 27/Jan/15 ]

Updated jstests to work on latest development version (3.0.0-rc6).

Generated at Thu Feb 08 03:43:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.