Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-40061

Chunk move fails due to DuplicateKey error on the `config.chunks` collection at migration commit

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.6.10
    • Component/s: Sharding
    • Labels:
      None
    • ALL
    • Sharding 2019-05-20, Sharding 2019-06-03, Sharding 2019-06-17, Sharding 2019-07-01, Sharding 2019-07-15, Sharding 2019-07-29, Sharding 2019-08-12

      The ChunkType::genID method uses the BSONElement::toString method, which was changed to provide a better formatting for UUID BinData. Unfortunately, the ChunkType::genID is used all around sharding-related code as a value of "_id" field in "config.chunks" collection. When the chunk minimum field has a value that is an UUID, the value of "_id" for v3.6 and v3.4 (and previous versions) differ.

      We've hit it when trying to move chunks manually in a cluser we recently moved from v3.4 to v3.6:

      2019-03-10T16:54:39.264+0300 I COMMAND  [conn469729] command admin.$cmd appName: "MongoDB Shell" command: _configsvrMoveChunk { _configsvrMoveChunk: 1, _id: "a.fs.chunks-files_id_UUID("05660000-0000-e000-96c2-dc81ca6fa911")n_0", ns: "a.fs.chunks", min: { files_id: UUID("05660000-0000-e000-96c2-dc81ca6fa911"), n: 0 }, max: { files_id: UUID("05666100-0000-e000-9252-7b82dea0b186"), n: 3 }, shard: "driveFS-2", lastmod: Timestamp(571033, 1), lastmodEpoch: ObjectId('51793868331d54dfcf8e0032'), toShard: "driveFS-17", maxChunkSizeBytes: 536870912, secondaryThrottle: {}, waitForDelete: false, writeConcern: { w: "majority", wtimeout: 15000 }, lsid: { id: UUID("605f316b-6296-4010-9f26-835b60f923ff"), uid: BinData(0, EE3A53D0CA965E6112DBEBF842D31DC81E8CE7E7548256DE28D08422B2C59D3B) }, $replData: 1, $clusterTime: { clusterTime: Timestamp(0, 0), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId:0 } }, $client: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "3.6.9" }, os: { type: "Windows", name: "Microsoft Windows 10", architecture: "x86_64", version: "10.0 (build 17134)" }, mongos: { host: "dorado2:27017",client: "10.254.3.70:1334", version: "3.6.10" } }, $configServerState: { opTime: { ts: Timestamp(1552225736, 38), t: 95 } }, $db: "admin" } exception: Chunk move was not successful due to E11000 duplicate key error collection: config.chunks index: ns_1_min_1 dup key: { : "a.fs.chunks", : { files_id: UUID("05660000-0000-e000-96c2-dc81ca6fa911"), n: 0 } } code:DuplicateKey numYields:0 reslen:562 locks:{ Global: { acquireCount: { r: 10, w: 6 } }, Database: { acquireCount: { r: 2, w: 6 } }, Collection: { acquireCount: { r: 2,w: 3 } }, oplog: { acquireCount: { w: 3 } } } protocol:op_msg 340766ms
      

      Of course, the "config.chunks" collection contains this:

      > db.chunks.find({ns:"a.fs.chunks",min: { files_id: UUID("05660000-0000-e000-96c2-dc81ca6fa911"), n: 0 } })
      { "_id" : "a.fs.chunks-files_id_BinData(4, 056600000000E00096C2DC81CA6FA911)n_0", "lastmod" : Timestamp(539637, 1290), "lastmodEpoch" : ObjectId("51793868331d54dfcf8e0032"), "ns" : "a.fs.chunks", "min" : { "files_id" : UUID("05660000-0000-e000-96c2-dc81ca6fa911"), "n" : 0 }, "max" : { "files_id" : UUID("05666100-0000-e000-9252-7b82dea0b186"), "n" : 3 }, "shard" : "driveFS-2" }
      

      Since I do not know what other operations are using the "_id" field, I cannot estimate the true potential of this problem, but cursory inspection of the codebase shows there are at least some places where the update is performed without checking the number of matched/modified documents, so it may be the case that the metadata (I mean the chunk structure) could be lost/damaged silently.

            Assignee:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Reporter:
            onyxmaster Aristarkh Zagorodnikov
            Votes:
            1 Vote for this issue
            Watchers:
            19 Start watching this issue

              Created:
              Updated:
              Resolved: