Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-74739

Return more clear error when migration fails because an overlapping range hasn't been deleted yet

    • Sharding EMEA
    • Fully Compatible
    • Sharding EMEA 2023-06-26

      Updated Description

      After moving a chunk away, the range deleter won't run for 15 minutes by default. If a user manually tries to move a chunk with an overlapping range back to the donor shard within 15 minutes, the moveChunk will fail with a hard to understand error: 

      "state":"fail","errmsg":"ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit" 

      We can consider improving the error reporting, so it's more clear to a user that nothing is wrong.

      Original Description

      I'm auth'ed as the root user

      [direct: mongos] test> db.runCommand({connectionStatus:1})
      {
        authInfo: {
          authenticatedUsers: [ { user: 'admin', db: 'admin' } ],
          authenticatedUserRoles: [ { role: 'root', db: 'admin' } ]
        },
        ok: 1,
        '$clusterTime': {
          clusterTime: Timestamp({ t: 1678429451, i: 2 }),
          signature: {
            hash: Binary(Buffer.from("0777f8b4db7cf21742fef145c96fe9125a51e183", "hex"), 0),
            keyId: Long("7208009756202500117")
          }
        },
        operationTime: Timestamp({ t: 1678429451, i: 2 })
      }
      

      Initial state

      [direct: mongos] test> sh.status()
      shardingVersion
      { _id: 1, clusterId: ObjectId("6407feaf46705024b23e5f69") }
      ---
      shards
      [
        {
          _id: 'config',
          host: 'csshard/localhost:27019',
          state: 1,
          topologyTime: Timestamp({ t: 1678339266, i: 2 })
        },
        {
          _id: 'shard2',
          host: 'shard2/localhost:27021',
          state: 1,
          topologyTime: Timestamp({ t: 1678338347, i: 1 })
        }
      ]
      ---
      active mongoses
      [ { '7.0.0-alpha-538-g7cec1b7': 1 } ]
      ---
      autosplit
      { 'Currently enabled': 'yes' }
      ---
      balancer
      { 'Currently enabled': 'yes', 'Currently running': 'no' }
      ---
      databases
      [
        {
          database: { _id: 'config', primary: 'config', partitioned: true },
          collections: {
            'config.system.sessions': {
              shardKey: { _id: 1 },
              unique: false,
              balancing: true,
              chunkMetadata: [ { shard: 'shard2', nChunks: 1024 } ],
              chunks: [
                'too many chunks to print, use verbose if you want to force print'
              ],
              tags: []
            }
          }
        },
        {
          database: {
            _id: 'test',
            primary: 'config',
            partitioned: false,
            version: {
              uuid: UUID("5ed830ec-10da-42f9-b92a-b7e3f78df969"),
              timestamp: Timestamp({ t: 1678245720, i: 1 }),
              lastMod: 2
            }
          },
          collections: {
            'test.bar': {
              shardKey: { a: 1 },
              unique: true,
              balancing: true,
              chunkMetadata: [ { shard: 'shard2', nChunks: 1 } ],
              chunks: [
                { min: { a: MinKey() }, max: { a: MaxKey() }, 'on shard': 'shard2', 'last modified': Timestamp({ t: 1, i: 0 }) }
              ],
              tags: []
            },
            'test.baz': {
              shardKey: { a: 1 },
              unique: true,
              balancing: true,
              chunkMetadata: [ { shard: 'shard2', nChunks: 1 } ],
              chunks: [
                { min: { a: MinKey() }, max: { a: MaxKey() }, 'on shard': 'shard2', 'last modified': Timestamp({ t: 1, i: 0 }) }
              ],
              tags: []
            },
            'test.shards': {
              shardKey: { a: 1 },
              unique: false,
              balancing: true,
              chunkMetadata: [ { shard: 'config', nChunks: 2 } ],
              chunks: [
                { min: { a: MinKey() }, max: { a: 500 }, 'on shard': 'config', 'last modified': Timestamp({ t: 3, i: 1 }) },
                { min: { a: 500 }, max: { a: MaxKey() }, 'on shard': 'config', 'last modified': Timestamp({ t: 3, i: 2 }) }
              ],
              tags: []
            }
          }
        }
      ]
      
      [direct: mongos] test> sh.moveChunk("test.shards", {a:200}, "shard2")
      MongoServerError: Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit
      

      That's weird. Try and add a longer maxTimeMS?

      [direct: mongos] test> sh.moveChunk("test.shards", {a:200}, "shard2").maxTimeMS(100000)
      MongoServerError: Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit
      

      Maybe this is something in mongosh that's adding a maxTimeMS. Run the server command directly

      [direct: mongos] test> db.adminCommand({moveChunk: "test.shards", find: {a:200}, to: "shard2"})
      MongoServerError: Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit
      

      I ran one with a timestamp so I could match it in the logs

      [direct: mongos] test> printjson(new Date()); db.adminCommand({moveChunk: "test.shards", find: {a:200}, to: "shard2"})
      ISODate("2023-03-10T06:20:12.878Z")
      MongoServerError: Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit
      

      Here's the logs from the csshard (source)

      {"t":{"$date":"2023-03-10T17:20:12.896+11:00"},"s":"I",  "c":"MIGRATE",  "id":22016,   "ctx":"MoveChunk","msg":"Starting chunk migration donation","attr":{"requestParameters":{"_shardsvrMoveRange":"test.shards","toShard":"shard2","min":{"a":{"$minKey":1}},"max":{"a":500},"waitForDelete":false,"epoch":{"$oid":"64080bb146705024b23ed319"},"fromShard":"config","maxChunkSizeBytes":134217728,"forceJumbo":0,"secondaryThrottle":false},"collectionEpoch":{"$oid":"64080bb146705024b23ed319"}}}
      {"t":{"$date":"2023-03-10T17:20:12.898+11:00"},"s":"I",  "c":"SH_REFR",  "id":4619901, "ctx":"CatalogCache-20","msg":"Refreshed cached collection","attr":{"namespace":"test.shards","lookupSinceVersion":"3|2||64080bb146705024b23ed319||Timestamp(1678248881, 3)","newVersion":"{ chunkVersion: { e: ObjectId('64080bb146705024b23ed319'), t: Timestamp(1678248881, 3), v: Timestamp(3, 2) }, forcedRefreshSequenceNum: 18949, epochDisambiguatingSequenceNum: 18938 }","timeInStore":"{ chunkVersion: \"None\", forcedRefreshSequenceNum: 18948, epochDisambiguatingSequenceNum: 18937 }","durationMillis":1}}
      {"t":{"$date":"2023-03-10T17:20:12.899+11:00"},"s":"I",  "c":"SHARDING", "id":22080,   "ctx":"MoveChunk","msg":"About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"M-K4F3614CKP:27019-2023-03-10T17:20:12.898+11:00-640acc1c34a26fa9736c391b","server":"M-K4F3614CKP:27019","shard":"config","clientAddr":"","time":{"$date":"2023-03-10T06:20:12.898Z"},"what":"moveChunk.start","ns":"test.shards","details":{"min":{"a":{"$minKey":1}},"max":{"a":500},"from":"config","to":"shard2"}}}}
      {"t":{"$date":"2023-03-10T17:20:13.931+11:00"},"s":"I",  "c":"SHARDING", "id":21993,   "ctx":"MoveChunk","msg":"moveChunk data transfer progress","attr":{"response":{"waited":true,"active":true,"sessionId":"config_shard2_640acc1c34a26fa9736c391d","ns":"test.shards","from":"csshard/localhost:27019","fromShardId":"config","min":{"a":{"$minKey":1}},"max":{"a":500},"shardKeyPattern":{"a":1},"supportsCriticalSectionDuringCatchUp":true,"state":"ready","counts":{"cloned":0,"clonedBytes":0,"catchup":0,"steady":0},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1678429212,"i":5}},"signature":{"hash":{"$binary":{"base64":
      "AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1678429212,"i":5}},"$topologyTime":{"$timestamp":{"t":1678339266,"i":2}},"operationTime":{"$timestamp":{"t":167842921
      0,"i":1}}},"memoryUsedBytes":0,"docsRemainingToClone":500,"untransferredModsSizeBytes":0}}
      ....
      {"t":{"$date":"2023-03-10T17:20:21.979+11:00"},"s":"I",  "c":"SHARDING", "id":21993,   "ctx":"MoveChunk","msg":"moveChunk data transfer progress","attr":{"response":{"waited":true,"active":true,"sessionId":"config_shard2_640acc1c34a26fa9736c391d","ns":"test.shards","from":"csshard/localhost:27019","fromShardId":"config","min":{"a":{"$minKey":1}},"max":{"a":500},"shardKeyPattern":{"a":1},"supportsCriticalSectionDuringCatchUp":true,"state":"ready","counts":{"cloned":0,"clonedBytes":0,"catchup":0,"steady":0},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1678429220,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1678429220,"i":1}},"$topologyTime":{"$timestamp":{"t":1678339266,"i":2}},"operationTime":{"$timestamp":{"t":1678429220,"i":1}}},"memoryUsedBytes":0,"docsRemainingToClone":500,"untransferredModsSizeBytes":0}}
      {"t":{"$date":"2023-03-10T17:20:22.934+11:00"},"s":"I",  "c":"SHARDING", "id":21993,   "ctx":"MoveChunk","msg":"moveChunk data transfer progress","attr":{"response":{"waited":true,"active":true,"sessionId":"config_shard2_640acc1c34a26fa9736c391d","ns":"test.shards","from":"csshard/localhost:27019","fromShardId":"config","min":{"a":{"$minKey":1}},"max":{"a":500},"shardKeyPattern":{"a":1},"supportsCriticalSectionDuringCatchUp":true,"state":"fail","errmsg":"ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit","counts":{"cloned":0,"clonedBytes":0,"catchup":0,"steady":0},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1678429221,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1678429221,"i":1}},"$topologyTime":{"$timestamp":{"t":1678339266,"i":2}},"operationTime":{"$timestamp":{"t":1678429220,"i":1}}},"memoryUsedBytes":0,"docsRemainingToClone":500,"untransferredModsSizeBytes":0}}
      {"t":{"$date":"2023-03-10T17:20:22.935+11:00"},"s":"I",  "c":"SHARDING", "id":22080,   "ctx":"MoveChunk","msg":"About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"M-K4F3614CKP:27019-2023-03-10T17:20:22.935+11:00-640acc2634a26fa9736c3988","server":"M-K4F3614CKP:27019","shard":"config","clientAddr":"","time":{"$date":"2023-03-10T06:20:22.935Z"},"what":"moveChunk.error","ns":"test.shards","details":{"min":{"a":{"$minKey":1}},"max":{"a":500},"from":"config","to":"shard2"}}}}
      {"t":{"$date":"2023-03-10T17:20:22.944+11:00"},"s":"I",  "c":"MIGRATE",  "id":23893,   "ctx":"MoveChunk","msg":"MigrationCoordinator delivering decision to self and to recipient","attr":{"decision":"aborted","migrationId":{"uuid":{"$uuid":"825014bd-b9e8-469f-ab54-6eb397efda20"}}}}
      {"t":{"$date":"2023-03-10T17:20:22.965+11:00"},"s":"I",  "c":"SHARDING", "id":22080,   "ctx":"MoveChunk","msg":"About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"M-K4F3614CKP:27019-2023-03-10T17:20:22.965+11:00-640acc2634a26fa9736c398e","server":"M-K4F3614CKP:27019","shard":"config","clientAddr":"","time":{"$date":"2023-03-10T06:20:22.965Z"},"what":"moveChunk.from","ns":"test.shards","details":{"step 1 of 6":0,"step 2 of 6":1,"step 3 of 6":27,"min":{"a":{"$minKey":1}},"max":{"a":500},"to":"shard2","from":"config","note":"aborted"}}}}
      {"t":{"$date":"2023-03-10T17:20:22.971+11:00"},"s":"W",  "c":"SHARDING", "id":23777,   "ctx":"MoveChunk","msg":"Error while doing moveChunk","attr":{"error":"OperationFailed: Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit"}}
      {"t":{"$date":"2023-03-10T17:20:22.975+11:00"},"s":"I",  "c":"MIGRATE",  "id":6386803, "ctx":"MoveChunk-4","msg":"Unregistering donate chunk","attr":{"namespace":"test.shards","min":{"a":{"$minKey":1}},"max":{"a":500},"toShardId":"shard2"}}
      {"t":{"$date":"2023-03-10T17:20:22.975+11:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn72","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"_shardsvrMoveRange":"test.shards","toShard":"shard2","min":{"a":{"$minKey":1}},"max":{"a":500},"waitForDelete":false,"epoch":{"$oid":"64080bb146705024b23ed319"},"fromShard":"config","maxChunkSizeBytes":134217728,"forceJumbo":0,"secondaryThrot
      tle":false,"writeConcern":{"w":1,"wtimeout":0},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1678429212,"i":1}},"signature":{"hash":{"$binary":{"base64":"ZRX3cyfrztWcyQFiASw9oSW+xOg=","subType":"0"}},"keyId
      ":7208009756202500117}},"$configTime":{"$timestamp":{"t":1678429212,"i":1}},"$topologyTime":{"$timestamp":{"t":1678339266,"i":2}},"mayBypassWriteBlocking":true,"$db":"admin"},"numYields":0,"ok":0,"errMsg":"Data transfer error: ExceededTimeLimit: Failed to delete orphaned test.shards range [{ a: MinKey }, { a: 500 }) :: caused by :: operation exceeded time limit","errName":"OperationFailed","errCode":96,"reslen":417
      ,"locks":{},"writeConcern":{"w":1,"wtimeout":0,"provenance":"clientSupplied"},"remote":"127.0.0.1:52326","protocol":"op_msg","durationMillis":10083}}
      

      Note: No documents actually moved ("docsRemainingToClone":500). But the timeout is "Failed to delete orphaned test.shards range"? We haven't copied anything - why is the range orphaned?

      These documents are nothing exciting

      [direct: mongos] test> db.shards.findOne()
      { _id: ObjectId("640ac95879d1df6155df1017"), a: 0 }
      

      I tried this on a cluster running with the same mongod binary version, but without config-shard (separate csrs). It worked fine, as expected.

      [direct: mongos] test> sh.status()
      shardingVersion
      { _id: 1, clusterId: ObjectId("6406d64c340177022f4fb4b1") }
      ---
      shards
      [
        {
          _id: 'shard01',
          host: 'shard01/localhost:30001',
          state: 1,
          topologyTime: Timestamp({ t: 1678169678, i: 3 })
        },
        {
          _id: 'shard02',
          host: 'shard02/localhost:30002',
          state: 1,
          topologyTime: Timestamp({ t: 1678169678, i: 8 })
        }
      ]
      ---
      active mongoses
      [ { '7.0.0-alpha-538-g7cec1b7': 1 } ]
      ---
      autosplit
      { 'Currently enabled': 'yes' }
      ---
      balancer
      { 'Currently enabled': 'yes', 'Currently running': 'no' }
      ---
      databases
      [
        {
          database: { _id: 'config', primary: 'config', partitioned: true },
          collections: {
            'config.system.sessions': {
              shardKey: { _id: 1 },
              unique: false,
              balancing: true,
              chunkMetadata: [ { shard: 'shard01', nChunks: 1024 } ],
              chunks: [
                'too many chunks to print, use verbose if you want to force print'
              ],
              tags: []
            }
          }
        },
        {
          database: {
            _id: 'test',
            primary: 'shard02',
            partitioned: false,
            version: {
              uuid: UUID("9caa0eb3-6206-4469-925e-7b73bbfafdcf"),
              timestamp: Timestamp({ t: 1678247389, i: 1 }),
              lastMod: 1
            }
          },
          collections: {
            'test.baz': {
              shardKey: { a: 1 },
              unique: true,
              balancing: true,
              chunkMetadata: [ { shard: 'shard02', nChunks: 1 } ],
              chunks: [
                { min: { a: MinKey() }, max: { a: MaxKey() }, 'on shard': 'shard02', 'last modified': Timestamp({ t: 1, i: 0 }) }
              ],
              tags: []
            },
            'test.foo': {
              shardKey: { a: 'hashed' },
              unique: false,
              balancing: true,
              chunkMetadata: [
                { shard: 'shard01', nChunks: 2 },
                { shard: 'shard02', nChunks: 2 }
              ],
              chunks: [
                { min: { a: MinKey() }, max: { a: Long("-4611686018427387902") }, 'on shard': 'shard01', 'last modified': Timestamp({ t: 1, i: 0 }) },
                { min: { a: Long("-4611686018427387902") }, max: { a: Long("0") }, 'on shard': 'shard01', 'last modified': Timestamp({ t: 1, i: 1 }) },
                { min: { a: Long("0") }, max: { a: Long("4611686018427387902") }, 'on shard': 'shard02', 'last modified': Timestamp({ t: 1, i: 2 }) },
                { min: { a: Long("4611686018427387902") }, max: { a: MaxKey() }, 'on shard': 'shard02', 'last modified': Timestamp({ t: 1, i: 5 }) }
              ],
              tags: []
            }
          }
        }
      ]
      [direct: mongos] test> sh.moveChunk("test.foo", {a:1}, "shard01")
      {
        millis: 158,
        ok: 1,
        '$clusterTime': {
          clusterTime: Timestamp({ t: 1678429378, i: 36 }),
          signature: {
            hash: Binary(Buffer.from("0000000000000000000000000000000000000000", "hex"), 0),
            keyId: Long("0")
          }
        },
        operationTime: Timestamp({ t: 1678429378, i: 36 })
      }
      

      Note for posterity: This second cluster doesn't have auth on. This might be an auth vs no-auth problem.
      But note: I ran this test as the root user - I can't escalate my privileges any more than that.

            Assignee:
            silvia.surroca@mongodb.com Silvia Surroca
            Reporter:
            joanna.cheng@mongodb.com Joanna Cheng
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: