Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Done
Priority: Major - P3
Fix Version/s: 1.9.1
Affects Version/s: None
Component/s: Index Maintenance
Labels:
None

hi guys,

maybe it is possible to add an option that makes it possible to continue with processing the bulk insert once a duplicate key/object id has occured?

my usecase:
i'm building a gridfs clone with data deduplication based on a hash of the chunk data and file revisions. my collecrtion with the chunks looks like this:
{_id: <objectid>, data: <bin-data>}

the object id of each chunk is a 12 byte hash of the chunk contents. therefore i'm using the md4 algorithm. (this makes it faster than having another unique index especially for the chunk hash.)

if i'm inserting a 100MB file (1600 chunks of 64KB), duplicate chunks won't be saved. this is my poormans method of deduplication
i've thought it would be much faster to collect 1000 chunks and do a bulk insert. but if the DB won't process the following chunks after a duplicate object id was found, my method doesn't work.
i would be very happy to have an option to continue on duplicate chunks.

for e.g. in pymongo:
db.collection.insert([my_doc_list], skip_duplicates=True)

thanks in advance,
marc

is depended on by

TOOLS-72 Mongoimport should do bulk insertion

Closed

Assignee:: Kyle Banker (Inactive)

Reporter:: Marc Boeker

Participants:: Antoine Girbal, auto, Benjamin Darfler, Dwight Merriman, Eliot Horowitz, Knut Forkalsrud, Kyle Banker, Marc Boeker, ofer fort

Votes:: 14 Vote for this issue

Watchers:: 18 Start watching this issue

Created:: Dec 30 2009 12:05:29 PM UTC

Updated:: Jul 12 2016 12:28:45 AM UTC

Resolved:: May 23 2011 11:01:06 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates