Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-509

Add option to continue with bulk insert on duplicate key/object id

    XMLWordPrintable

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9.1
    • Component/s: Indexing
    • Labels:
      None

      Description

      hi guys,

      maybe it is possible to add an option that makes it possible to continue with processing the bulk insert once a duplicate key/object id has occured?

      my usecase:
      i'm building a gridfs clone with data deduplication based on a hash of the chunk data and file revisions. my collecrtion with the chunks looks like this:
      {_id: <objectid>, data: <bin-data>}

      the object id of each chunk is a 12 byte hash of the chunk contents. therefore i'm using the md4 algorithm. (this makes it faster than having another unique index especially for the chunk hash.)

      if i'm inserting a 100MB file (1600 chunks of 64KB), duplicate chunks won't be saved. this is my poormans method of deduplication
      i've thought it would be much faster to collect 1000 chunks and do a bulk insert. but if the DB won't process the following chunks after a duplicate object id was found, my method doesn't work.
      i would be very happy to have an option to continue on duplicate chunks.

      for e.g. in pymongo:
      db.collection.insert([my_doc_list], skip_duplicates=True)

      thanks in advance,
      marc

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                14 Vote for this issue
                Watchers:
                18 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: