-
Type:
New Feature
-
Resolution: Done
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Index Maintenance
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
hi guys,
maybe it is possible to add an option that makes it possible to continue with processing the bulk insert once a duplicate key/object id has occured?
my usecase:
i'm building a gridfs clone with data deduplication based on a hash of the chunk data and file revisions. my collecrtion with the chunks looks like this:
{_id: <objectid>, data: <bin-data>}
the object id of each chunk is a 12 byte hash of the chunk contents. therefore i'm using the md4 algorithm. (this makes it faster than having another unique index especially for the chunk hash.)
if i'm inserting a 100MB file (1600 chunks of 64KB), duplicate chunks won't be saved. this is my poormans method of deduplication ![]()
i've thought it would be much faster to collect 1000 chunks and do a bulk insert. but if the DB won't process the following chunks after a duplicate object id was found, my method doesn't work.
i would be very happy to have an option to continue on duplicate chunks.
for e.g. in pymongo:
db.collection.insert([my_doc_list], skip_duplicates=True)
thanks in advance,
marc
- is depended on by
-
TOOLS-72 Mongoimport should do bulk insertion
-
- Closed
-