Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Done
Priority: Major - P3
Fix Version/s: 1.9.1
Affects Version/s: None
Component/s: Index Maintenance
Labels:
None

CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

hi guys,

maybe it is possible to add an option that makes it possible to continue with processing the bulk insert once a duplicate key/object id has occured?

my usecase:
i'm building a gridfs clone with data deduplication based on a hash of the chunk data and file revisions. my collecrtion with the chunks looks like this:
{_id: <objectid>, data: <bin-data>}

the object id of each chunk is a 12 byte hash of the chunk contents. therefore i'm using the md4 algorithm. (this makes it faster than having another unique index especially for the chunk hash.)

if i'm inserting a 100MB file (1600 chunks of 64KB), duplicate chunks won't be saved. this is my poormans method of deduplication
i've thought it would be much faster to collect 1000 chunks and do a bulk insert. but if the DB won't process the following chunks after a duplicate object id was found, my method doesn't work.
i would be very happy to have an option to continue on duplicate chunks.

for e.g. in pymongo:
db.collection.insert([my_doc_list], skip_duplicates=True)

thanks in advance,
marc

is depended on by

TOOLS-72 Mongoimport should do bulk insertion

Closed

Assignee:: Kyle Banker (Inactive)
Reporter:: Marc Boeker
Participants:: Antoine Girbal, auto, Benjamin Darfler, Dwight Merriman, Eliot Horowitz, Knut Forkalsrud, Kyle Banker, Marc Boeker, ofer fort
Votes:: 14 Vote for this issue
Watchers:: 18 Start watching this issue

Created:: Dec 30 2009 12:05:29 PM UTC
Updated:: Jul 12 2016 12:28:45 AM UTC
Resolved:: May 23 2011 11:01:06 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates