Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-509

Add option to continue with bulk insert on duplicate key/object id

    XMLWordPrintableJSON

Details

    • Icon: New Feature New Feature
    • Resolution: Done
    • Icon: Major - P3 Major - P3
    • 1.9.1
    • None
    • Index Maintenance
    • None

    Description

      hi guys,

      maybe it is possible to add an option that makes it possible to continue with processing the bulk insert once a duplicate key/object id has occured?

      my usecase:
      i'm building a gridfs clone with data deduplication based on a hash of the chunk data and file revisions. my collecrtion with the chunks looks like this:
      {_id: <objectid>, data: <bin-data>}

      the object id of each chunk is a 12 byte hash of the chunk contents. therefore i'm using the md4 algorithm. (this makes it faster than having another unique index especially for the chunk hash.)

      if i'm inserting a 100MB file (1600 chunks of 64KB), duplicate chunks won't be saved. this is my poormans method of deduplication
      i've thought it would be much faster to collect 1000 chunks and do a bulk insert. but if the DB won't process the following chunks after a duplicate object id was found, my method doesn't work.
      i would be very happy to have an option to continue on duplicate chunks.

      for e.g. in pymongo:
      db.collection.insert([my_doc_list], skip_duplicates=True)

      thanks in advance,
      marc

      Attachments

        Activity

          People

            kbanker Kyle Banker
            marc Marc Boeker
            Votes:
            14 Vote for this issue
            Watchers:
            18 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: