[SERVER-74739] Return more clear error when migration fails because an overlapping range hasn't been deleted yet Created: 10/Mar/23  Updated: 29/Oct/23  Resolved: 20/Jun/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Joanna Cheng Assignee: Silvia Surroca
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day, skunkelodeon-odcs
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Sprint: Sharding EMEA 2023-06-26
Participants:

 Description   

Updated Description

After moving a chunk away, the range deleter won't run for 15 minutes by default. If a user manually tries to move a chunk with an overlapping range back to the donor shard within 15 minutes, the moveChunk will fail with a hard to understand error: 

"state":"fail","errmsg":"ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit" 

We can consider improving the error reporting, so it's more clear to a user that nothing is wrong.

Original Description

I'm auth'ed as the root user

[direct: mongos] test> db.runCommand({connectionStatus:1})
{
  authInfo: {
    authenticatedUsers: [ { user: 'admin', db: 'admin' } ],
    authenticatedUserRoles: [ { role: 'root', db: 'admin' } ]
  },
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1678429451, i: 2 }),
    signature: {
      hash: Binary(Buffer.from("0777f8b4db7cf21742fef145c96fe9125a51e183", "hex"), 0),
      keyId: Long("7208009756202500117")
    }
  },
  operationTime: Timestamp({ t: 1678429451, i: 2 })
}

Initial state

[direct: mongos] test> sh.status()
shardingVersion
{ _id: 1, clusterId: ObjectId("6407feaf46705024b23e5f69") }
---
shards
[
  {
    _id: 'config',
    host: 'csshard/localhost:27019',
    state: 1,
    topologyTime: Timestamp({ t: 1678339266, i: 2 })
  },
  {
    _id: 'shard2',
    host: 'shard2/localhost:27021',
    state: 1,
    topologyTime: Timestamp({ t: 1678338347, i: 1 })
  }
]
---
active mongoses
[ { '7.0.0-alpha-538-g7cec1b7': 1 } ]
---
autosplit
{ 'Currently enabled': 'yes' }
---
balancer
{ 'Currently enabled': 'yes', 'Currently running': 'no' }
---
databases
[
  {
    database: { _id: 'config', primary: 'config', partitioned: true },
    collections: {
      'config.system.sessions': {
        shardKey: { _id: 1 },
        unique: false,
        balancing: true,
        chunkMetadata: [ { shard: 'shard2', nChunks: 1024 } ],
        chunks: [
          'too many chunks to print, use verbose if you want to force print'
        ],
        tags: []
      }
    }
  },
  {
    database: {
      _id: 'test',
      primary: 'config',
      partitioned: false,
      version: {
        uuid: UUID("5ed830ec-10da-42f9-b92a-b7e3f78df969"),
        timestamp: Timestamp({ t: 1678245720, i: 1 }),
        lastMod: 2
      }
    },
    collections: {
      'test.bar': {
        shardKey: { a: 1 },
        unique: true,
        balancing: true,
        chunkMetadata: [ { shard: 'shard2', nChunks: 1 } ],
        chunks: [
          { min: { a: MinKey() }, max: { a: MaxKey() }, 'on shard': 'shard2', 'last modified': Timestamp({ t: 1, i: 0 }) }
        ],
        tags: []
      },
      'test.baz': {
        shardKey: { a: 1 },
        unique: true,
        balancing: true,
        chunkMetadata: [ { shard: 'shard2', nChunks: 1 } ],
        chunks: [
          { min: { a: MinKey() }, max: { a: MaxKey() }, 'on shard': 'shard2', 'last modified': Timestamp({ t: 1, i: 0 }) }
        ],
        tags: []
      },
      'test.shards': {
        shardKey: { a: 1 },
        unique: false,
        balancing: true,
        chunkMetadata: [ { shard: 'config', nChunks: 2 } ],
        chunks: [
          { min: { a: MinKey() }, max: { a: 500 }, 'on shard': 'config', 'last modified': Timestamp({ t: 3, i: 1 }) },
          { min: { a: 500 }, max: { a: MaxKey() }, 'on shard': 'config', 'last modified': Timestamp({ t: 3, i: 2 }) }
        ],
        tags: []
      }
    }
  }
]

[direct: mongos] test> sh.moveChunk("test.shards", {a:200}, "shard2")
MongoServerError: Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit

That's weird. Try and add a longer maxTimeMS?

[direct: mongos] test> sh.moveChunk("test.shards", {a:200}, "shard2").maxTimeMS(100000)
MongoServerError: Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit

Maybe this is something in mongosh that's adding a maxTimeMS. Run the server command directly

[direct: mongos] test> db.adminCommand({moveChunk: "test.shards", find: {a:200}, to: "shard2"})
MongoServerError: Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit

I ran one with a timestamp so I could match it in the logs

[direct: mongos] test> printjson(new Date()); db.adminCommand({moveChunk: "test.shards", find: {a:200}, to: "shard2"})
ISODate("2023-03-10T06:20:12.878Z")
MongoServerError: Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit

Here's the logs from the csshard (source)

{"t":{"$date":"2023-03-10T17:20:12.896+11:00"},"s":"I",  "c":"MIGRATE",  "id":22016,   "ctx":"MoveChunk","msg":"Starting chunk migration donation","attr":{"requestParameters":{"_shardsvrMoveRange":"test.shards","toShard":"shard2","min":{"a":{"$minKey":1}},"max":{"a":500},"waitForDelete":false,"epoch":{"$oid":"64080bb146705024b23ed319"},"fromShard":"config","maxChunkSizeBytes":134217728,"forceJumbo":0,"secondaryThrottle":false},"collectionEpoch":{"$oid":"64080bb146705024b23ed319"}}}
{"t":{"$date":"2023-03-10T17:20:12.898+11:00"},"s":"I",  "c":"SH_REFR",  "id":4619901, "ctx":"CatalogCache-20","msg":"Refreshed cached collection","attr":{"namespace":"test.shards","lookupSinceVersion":"3|2||64080bb146705024b23ed319||Timestamp(1678248881, 3)","newVersion":"{ chunkVersion: { e: ObjectId('64080bb146705024b23ed319'), t: Timestamp(1678248881, 3), v: Timestamp(3, 2) }, forcedRefreshSequenceNum: 18949, epochDisambiguatingSequenceNum: 18938 }","timeInStore":"{ chunkVersion: \"None\", forcedRefreshSequenceNum: 18948, epochDisambiguatingSequenceNum: 18937 }","durationMillis":1}}
{"t":{"$date":"2023-03-10T17:20:12.899+11:00"},"s":"I",  "c":"SHARDING", "id":22080,   "ctx":"MoveChunk","msg":"About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"M-K4F3614CKP:27019-2023-03-10T17:20:12.898+11:00-640acc1c34a26fa9736c391b","server":"M-K4F3614CKP:27019","shard":"config","clientAddr":"","time":{"$date":"2023-03-10T06:20:12.898Z"},"what":"moveChunk.start","ns":"test.shards","details":{"min":{"a":{"$minKey":1}},"max":{"a":500},"from":"config","to":"shard2"}}}}
{"t":{"$date":"2023-03-10T17:20:13.931+11:00"},"s":"I",  "c":"SHARDING", "id":21993,   "ctx":"MoveChunk","msg":"moveChunk data transfer progress","attr":{"response":{"waited":true,"active":true,"sessionId":"config_shard2_640acc1c34a26fa9736c391d","ns":"test.shards","from":"csshard/localhost:27019","fromShardId":"config","min":{"a":{"$minKey":1}},"max":{"a":500},"shardKeyPattern":{"a":1},"supportsCriticalSectionDuringCatchUp":true,"state":"ready","counts":{"cloned":0,"clonedBytes":0,"catchup":0,"steady":0},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1678429212,"i":5}},"signature":{"hash":{"$binary":{"base64":
"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1678429212,"i":5}},"$topologyTime":{"$timestamp":{"t":1678339266,"i":2}},"operationTime":{"$timestamp":{"t":167842921
0,"i":1}}},"memoryUsedBytes":0,"docsRemainingToClone":500,"untransferredModsSizeBytes":0}}
....
{"t":{"$date":"2023-03-10T17:20:21.979+11:00"},"s":"I",  "c":"SHARDING", "id":21993,   "ctx":"MoveChunk","msg":"moveChunk data transfer progress","attr":{"response":{"waited":true,"active":true,"sessionId":"config_shard2_640acc1c34a26fa9736c391d","ns":"test.shards","from":"csshard/localhost:27019","fromShardId":"config","min":{"a":{"$minKey":1}},"max":{"a":500},"shardKeyPattern":{"a":1},"supportsCriticalSectionDuringCatchUp":true,"state":"ready","counts":{"cloned":0,"clonedBytes":0,"catchup":0,"steady":0},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1678429220,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1678429220,"i":1}},"$topologyTime":{"$timestamp":{"t":1678339266,"i":2}},"operationTime":{"$timestamp":{"t":1678429220,"i":1}}},"memoryUsedBytes":0,"docsRemainingToClone":500,"untransferredModsSizeBytes":0}}
{"t":{"$date":"2023-03-10T17:20:22.934+11:00"},"s":"I",  "c":"SHARDING", "id":21993,   "ctx":"MoveChunk","msg":"moveChunk data transfer progress","attr":{"response":{"waited":true,"active":true,"sessionId":"config_shard2_640acc1c34a26fa9736c391d","ns":"test.shards","from":"csshard/localhost:27019","fromShardId":"config","min":{"a":{"$minKey":1}},"max":{"a":500},"shardKeyPattern":{"a":1},"supportsCriticalSectionDuringCatchUp":true,"state":"fail","errmsg":"ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit","counts":{"cloned":0,"clonedBytes":0,"catchup":0,"steady":0},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1678429221,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1678429221,"i":1}},"$topologyTime":{"$timestamp":{"t":1678339266,"i":2}},"operationTime":{"$timestamp":{"t":1678429220,"i":1}}},"memoryUsedBytes":0,"docsRemainingToClone":500,"untransferredModsSizeBytes":0}}
{"t":{"$date":"2023-03-10T17:20:22.935+11:00"},"s":"I",  "c":"SHARDING", "id":22080,   "ctx":"MoveChunk","msg":"About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"M-K4F3614CKP:27019-2023-03-10T17:20:22.935+11:00-640acc2634a26fa9736c3988","server":"M-K4F3614CKP:27019","shard":"config","clientAddr":"","time":{"$date":"2023-03-10T06:20:22.935Z"},"what":"moveChunk.error","ns":"test.shards","details":{"min":{"a":{"$minKey":1}},"max":{"a":500},"from":"config","to":"shard2"}}}}
{"t":{"$date":"2023-03-10T17:20:22.944+11:00"},"s":"I",  "c":"MIGRATE",  "id":23893,   "ctx":"MoveChunk","msg":"MigrationCoordinator delivering decision to self and to recipient","attr":{"decision":"aborted","migrationId":{"uuid":{"$uuid":"825014bd-b9e8-469f-ab54-6eb397efda20"}}}}
{"t":{"$date":"2023-03-10T17:20:22.965+11:00"},"s":"I",  "c":"SHARDING", "id":22080,   "ctx":"MoveChunk","msg":"About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"M-K4F3614CKP:27019-2023-03-10T17:20:22.965+11:00-640acc2634a26fa9736c398e","server":"M-K4F3614CKP:27019","shard":"config","clientAddr":"","time":{"$date":"2023-03-10T06:20:22.965Z"},"what":"moveChunk.from","ns":"test.shards","details":{"step 1 of 6":0,"step 2 of 6":1,"step 3 of 6":27,"min":{"a":{"$minKey":1}},"max":{"a":500},"to":"shard2","from":"config","note":"aborted"}}}}
{"t":{"$date":"2023-03-10T17:20:22.971+11:00"},"s":"W",  "c":"SHARDING", "id":23777,   "ctx":"MoveChunk","msg":"Error while doing moveChunk","attr":{"error":"OperationFailed: Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit"}}
{"t":{"$date":"2023-03-10T17:20:22.975+11:00"},"s":"I",  "c":"MIGRATE",  "id":6386803, "ctx":"MoveChunk-4","msg":"Unregistering donate chunk","attr":{"namespace":"test.shards","min":{"a":{"$minKey":1}},"max":{"a":500},"toShardId":"shard2"}}
{"t":{"$date":"2023-03-10T17:20:22.975+11:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn72","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"_shardsvrMoveRange":"test.shards","toShard":"shard2","min":{"a":{"$minKey":1}},"max":{"a":500},"waitForDelete":false,"epoch":{"$oid":"64080bb146705024b23ed319"},"fromShard":"config","maxChunkSizeBytes":134217728,"forceJumbo":0,"secondaryThrot
tle":false,"writeConcern":{"w":1,"wtimeout":0},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1678429212,"i":1}},"signature":{"hash":{"$binary":{"base64":"ZRX3cyfrztWcyQFiASw9oSW+xOg=","subType":"0"}},"keyId
":7208009756202500117}},"$configTime":{"$timestamp":{"t":1678429212,"i":1}},"$topologyTime":{"$timestamp":{"t":1678339266,"i":2}},"mayBypassWriteBlocking":true,"$db":"admin"},"numYields":0,"ok":0,"errMsg":"Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit","errName":"OperationFailed","errCode":96,"reslen":417
,"locks":{},"writeConcern":{"w":1,"wtimeout":0,"provenance":"clientSupplied"},"remote":"127.0.0.1:52326","protocol":"op_msg","durationMillis":10083}}

Note: No documents actually moved ("docsRemainingToClone":500). But the timeout is "Failed to delete orphaned test.shards range"? We haven't copied anything - why is the range orphaned?

These documents are nothing exciting

[direct: mongos] test> db.shards.findOne()
{ _id: ObjectId("640ac95879d1df6155df1017"), a: 0 }

I tried this on a cluster running with the same mongod binary version, but without config-shard (separate csrs). It worked fine, as expected.

[direct: mongos] test> sh.status()
shardingVersion
{ _id: 1, clusterId: ObjectId("6406d64c340177022f4fb4b1") }
---
shards
[
  {
    _id: 'shard01',
    host: 'shard01/localhost:30001',
    state: 1,
    topologyTime: Timestamp({ t: 1678169678, i: 3 })
  },
  {
    _id: 'shard02',
    host: 'shard02/localhost:30002',
    state: 1,
    topologyTime: Timestamp({ t: 1678169678, i: 8 })
  }
]
---
active mongoses
[ { '7.0.0-alpha-538-g7cec1b7': 1 } ]
---
autosplit
{ 'Currently enabled': 'yes' }
---
balancer
{ 'Currently enabled': 'yes', 'Currently running': 'no' }
---
databases
[
  {
    database: { _id: 'config', primary: 'config', partitioned: true },
    collections: {
      'config.system.sessions': {
        shardKey: { _id: 1 },
        unique: false,
        balancing: true,
        chunkMetadata: [ { shard: 'shard01', nChunks: 1024 } ],
        chunks: [
          'too many chunks to print, use verbose if you want to force print'
        ],
        tags: []
      }
    }
  },
  {
    database: {
      _id: 'test',
      primary: 'shard02',
      partitioned: false,
      version: {
        uuid: UUID("9caa0eb3-6206-4469-925e-7b73bbfafdcf"),
        timestamp: Timestamp({ t: 1678247389, i: 1 }),
        lastMod: 1
      }
    },
    collections: {
      'test.baz': {
        shardKey: { a: 1 },
        unique: true,
        balancing: true,
        chunkMetadata: [ { shard: 'shard02', nChunks: 1 } ],
        chunks: [
          { min: { a: MinKey() }, max: { a: MaxKey() }, 'on shard': 'shard02', 'last modified': Timestamp({ t: 1, i: 0 }) }
        ],
        tags: []
      },
      'test.foo': {
        shardKey: { a: 'hashed' },
        unique: false,
        balancing: true,
        chunkMetadata: [
          { shard: 'shard01', nChunks: 2 },
          { shard: 'shard02', nChunks: 2 }
        ],
        chunks: [
          { min: { a: MinKey() }, max: { a: Long("-4611686018427387902") }, 'on shard': 'shard01', 'last modified': Timestamp({ t: 1, i: 0 }) },
          { min: { a: Long("-4611686018427387902") }, max: { a: Long("0") }, 'on shard': 'shard01', 'last modified': Timestamp({ t: 1, i: 1 }) },
          { min: { a: Long("0") }, max: { a: Long("4611686018427387902") }, 'on shard': 'shard02', 'last modified': Timestamp({ t: 1, i: 2 }) },
          { min: { a: Long("4611686018427387902") }, max: { a: MaxKey() }, 'on shard': 'shard02', 'last modified': Timestamp({ t: 1, i: 5 }) }
        ],
        tags: []
      }
    }
  }
]
[direct: mongos] test> sh.moveChunk("test.foo", {a:1}, "shard01")
{
  millis: 158,
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1678429378, i: 36 }),
    signature: {
      hash: Binary(Buffer.from("0000000000000000000000000000000000000000", "hex"), 0),
      keyId: Long("0")
    }
  },
  operationTime: Timestamp({ t: 1678429378, i: 36 })
}


Note for posterity: This second cluster doesn't have auth on. This might be an auth vs no-auth problem.
But note: I ran this test as the root user - I can't escalate my privileges any more than that.



 Comments   
Comment by Githook User [ 19/Jun/23 ]

Author:

{'name': 'Silvia Surroca', 'email': 'silvia.surroca@mongodb.com', 'username': 'silviasuhu'}

Message: SERVER-74739 Return more clear error when migration fails because an overlapping range hasn't been deleted yet
Branch: master
https://github.com/mongodb/mongo/commit/f42da586464aac5d24ea215edd668e14e31375ba

Comment by Jack Mulrow [ 14/Apr/23 ]

Returning a more clear error definitely makes sense to me. I don't think users manually use moveChunk often though (if at all), so it probably isn't a high priority, but we can at least evaluate how much work this would be. This isn't PM-2290 related though, so I'm moving this to the general Sharding NYC backlog for triage and updated the description / title.

Comment by Joanna Cheng [ 14/Mar/23 ]

Yes, it did! I moved it just before this. I had no idea there was a 15 min timeout.

[direct: mongos] admin> sh.moveChunk("test.shards", {a:1}, "config")
{
  millis: 209,
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1678428416, i: 7 }),
    signature: {
      hash: Binary(Buffer.from("9f6eefe121b8cd7f9040137ff959d750f6e1143d", "hex"), 0),
      keyId: Long("7208009756202500117")
    }
  },
  operationTime: Timestamp({ t: 1678428416, i: 7 })
}
[direct: mongos] admin> sh.status()

For my test on the "normal cluster" I did not attempt the two chunk moves within 15 minutes. I re-tested, and I get the same error if I try to do two chunk moves too quickly. I suspect the same 15 minute timeout applies.

[direct: mongos] test> sh.moveChunk("test.foo", {a:700}, "shard01")
{
  millis: 151,
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1678764087, i: 33 }),
    signature: {
      hash: Binary(Buffer.from("0000000000000000000000000000000000000000", "hex"), 0),
      keyId: Long("0")
    }
  },
  operationTime: Timestamp({ t: 1678764087, i: 33 })
}
[direct: mongos] test> sh.moveChunk("test.foo", {a:700}, "shard02")
MongoServerError: Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.foo range [{ a: 4611686018427387902 }, { a: MaxKey }) :: caused by :: operation exceeded time limit


Back to the issue on the config shard cluster, the error on the target shard is much clearer

{"t":{"$date":"2023-03-10T17:10:06.811+11:00"},"s":"I",  "c":"MIGRATE",  "id":22000,   "ctx":"migrateThread","msg":"Starting receiving end of chunk migration","attr":{"chunkMin":{"a":{"$minKey":1}},"chunkMax":{"a":500},"namespace":"test.shards","fromShard":"config","epoch":{"$oid":"64080bb146705024b23ed319"},"sessionId":"config_shard2_640ac9be34a26fa9736c2129","migrationI
d":{"uuid":{"$uuid":"712d8505-6e90-42a5-938e-23eb3d07f60c"}}}}
{"t":{"$date":"2023-03-10T17:10:06.812+11:00"},"s":"I",  "c":"MIGRATE",  "id":22001,   "ctx":"migrateThread","msg":"Migration paused because the requested range overlaps with a range already scheduled for deletion","attr":{"namespace":"test.shards","range":"[{ a: MinKey }, { a: 500 })","migrationId":{"uuid":{"$uuid":"712d8505-6e90-42a5-938e-23eb3d07f60c"}}}}
{"t":{"$date":"2023-03-10T17:10:06.812+11:00"},"s":"I",  "c":"MIGRATE",  "id":21919,   "ctx":"migrateThread","msg":"Waiting for deletion of orphans","attr":{"namespace":"test.shards","orphanRange":{"min":{"a":{"$minKey":1}},"max":{"a":500}}}}
...
{"t":{"$date":"2023-03-10T17:10:16.822+11:00"},"s":"I",  "c":"SHARDING", "id":5087102, "ctx":"migrateThread","msg":"Recipient failed to copy oplog entries for retryable writes and transactions from donor","attr":{"namespace":"test.shards","migrationSessionId":"config_shard2_640ac9be34a26fa9736c2129","fromShard":"config","error":"ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit"}}
{"t":{"$date":"2023-03-10T17:10:16.822+11:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn10","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"_recvChunkStatus":"test.shards","waitForSteadyOrDone":true,"sessionId":"config_shard2_640ac9be34a26fa9736c2129","$clusterTime":{"clusterTime":{"$timestamp":{"t":1678428615,"i":1}},"signature":{"hash":{"$binary":{"base64":"gWmQUzLfsjbaTENbkRaVUomirJ8=","subType":"0"}},"keyId":7208009756202500117}},"$configTime":{"$timestamp":{"t":1678428615,"i":1}},"$topologyTime":{"$timestamp":{"t":1678339266,"i":2}},"$db":"admin"},"numYields":0,"reslen":707,"locks":{"FeatureCompatibilityVersion":{"acquireCount":{"r":2}},"Global":{"acquireCount":{"r":2}},"Mutex":{"acquireCount":{"r":2}}},"storage":{},"remote":"127.0.0.1:51457","protocol":"op_msg","durationMillis":944}}
{"t":{"$date":"2023-03-10T17:10:16.823+11:00"},"s":"I",  "c":"SHARDING", "id":22080,   "ctx":"migrateThread","msg":"About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"M-K4F3614CKP:27021-2023-03-10T17:10:16.823+11:00-640ac9c8c21ec3c2dfcd262f","server":"M-K4F3614CKP:27021","shard":"shard2","clientAddr":"","time":{"$date":"2023-03-10T06:10:16.823Z"},"what":"moveChunk.to","ns":"test.shards","details":{"min":{"a":{"$minKey":1}},"max":{"a":500},"to":"shard2","from":"config","note":"aborted","errmsg":"ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit"}}}}
{"t":{"$date":"2023-03-10T17:10:16.829+11:00"},"s":"I",  "c":"MIGRATE",  "id":5004703, "ctx":"migrateThread","msg":"clearReceiveChunk","attr":{"currentKeys":"[{ a: MinKey }, { a: 500 })"}}

I wonder if surfacing "Migration paused because the requested range overlaps with a range already scheduled for deletion" would be more helpful to connect the dots about what is stuck. What do you think?

Comment by Jack Mulrow [ 10/Mar/23 ]

Did the chunk that couldn't be moved used to belong to the shard you were moving it to? One explanation could be the orphanCleanupDelaySecs parameter, which prevents deleting orphaned chunks until 15 minutes after their chunk was moved away, which blocks moving that original chunk back until the deletion has finished. It is odd this doesn't happen for normal shards, but I could imagine logic somewhere that lowers the parameter for shards and assumes it doesn't need to for a config server node.

Comment by Joanna Cheng [ 10/Mar/23 ]

It occurred to me afterwards that in the second cluster, the chunk was actually empty. So I inserted a document and moved it back

[direct: mongos] test> db.foo.insert({a:1})
DeprecationWarning: Collection.insert() is deprecated. Use insertOne, insertMany, or bulkWrite.
{
  acknowledged: true,
  insertedIds: { '0': ObjectId("640ad3637df14c616e25ccb2") }
}
[direct: mongos] test> sh.moveChunk("test.foo", {a:1}, "shard02")
{
  millis: 164,
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1678431083, i: 36 }),
    signature: {
      hash: Binary(Buffer.from("0000000000000000000000000000000000000000", "hex"), 0),
      keyId: Long("0")
    }
  },
  operationTime: Timestamp({ t: 1678431083, i: 36 })
}

I can't imagine that this would be a problem with hashed vs non-hashed, but just in case I did the same for test.baz (sharded on a:1, initial state is 1 chunk on shard02)

// split the chunk to get 2 chunks on shard02
[direct: mongos] test> sh.splitAt("test.baz", {a:500})
{
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1678431156, i: 6 }),
    signature: {
      hash: Binary(Buffer.from("0000000000000000000000000000000000000000", "hex"), 0),
      keyId: Long("0")
    }
  },
  operationTime: Timestamp({ t: 1678431156, i: 6 })
}
[direct: mongos] test> db.baz.insert({a:700})
{
  acknowledged: true,
  insertedIds: { '0': ObjectId("640ad3d47df14c616e25ccb3") }
}
// move the chunk on shard02 with the document, to shard01
[direct: mongos] test> sh.moveChunk("test.baz", {a:700}, "shard01")
{
  millis: 170,
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1678431203, i: 41 }),
    signature: {
      hash: Binary(Buffer.from("0000000000000000000000000000000000000000", "hex"), 0),
      keyId: Long("0")
    }
  },
  operationTime: Timestamp({ t: 1678431203, i: 41 })
}

Generated at Thu Feb 08 06:28:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.