Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-509

Add option to continue with bulk insert on duplicate key/object id

    • Type: Icon: New Feature New Feature
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 1.9.1
    • Affects Version/s: None
    • Component/s: Index Maintenance
    • Labels:
      None

      hi guys,

      maybe it is possible to add an option that makes it possible to continue with processing the bulk insert once a duplicate key/object id has occured?

      my usecase:
      i'm building a gridfs clone with data deduplication based on a hash of the chunk data and file revisions. my collecrtion with the chunks looks like this:
      {_id: <objectid>, data: <bin-data>}

      the object id of each chunk is a 12 byte hash of the chunk contents. therefore i'm using the md4 algorithm. (this makes it faster than having another unique index especially for the chunk hash.)

      if i'm inserting a 100MB file (1600 chunks of 64KB), duplicate chunks won't be saved. this is my poormans method of deduplication
      i've thought it would be much faster to collect 1000 chunks and do a bulk insert. but if the DB won't process the following chunks after a duplicate object id was found, my method doesn't work.
      i would be very happy to have an option to continue on duplicate chunks.

      for e.g. in pymongo:
      db.collection.insert([my_doc_list], skip_duplicates=True)

      thanks in advance,
      marc

            Assignee:
            kbanker Kyle Banker
            Reporter:
            marc Marc Boeker
            Votes:
            14 Vote for this issue
            Watchers:
            18 Start watching this issue

              Created:
              Updated:
              Resolved: