[SERVER-509] Add option to continue with bulk insert on duplicate key/object id Created: 30/Dec/09  Updated: 12/Jul/16  Resolved: 23/May/11

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: None
Fix Version/s: 1.9.1

Type: New Feature Priority: Major - P3
Reporter: Marc Boeker Assignee: Kyle Banker
Resolution: Done Votes: 14
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by TOOLS-72 Mongoimport should do bulk insertion Closed
Related
Participants:

 Description   

hi guys,

maybe it is possible to add an option that makes it possible to continue with processing the bulk insert once a duplicate key/object id has occured?

my usecase:
i'm building a gridfs clone with data deduplication based on a hash of the chunk data and file revisions. my collecrtion with the chunks looks like this:
{_id: <objectid>, data: <bin-data>}

the object id of each chunk is a 12 byte hash of the chunk contents. therefore i'm using the md4 algorithm. (this makes it faster than having another unique index especially for the chunk hash.)

if i'm inserting a 100MB file (1600 chunks of 64KB), duplicate chunks won't be saved. this is my poormans method of deduplication
i've thought it would be much faster to collect 1000 chunks and do a bulk insert. but if the DB won't process the following chunks after a duplicate object id was found, my method doesn't work.
i would be very happy to have an option to continue on duplicate chunks.

for e.g. in pymongo:
db.collection.insert([my_doc_list], skip_duplicates=True)

thanks in advance,
marc



 Comments   
Comment by Antoine Girbal [ 25/May/11 ]

quick notes:

  • how can you use a md4 to deduplicate chunks?
    It sounds like if the md4 matches, you dont insert the chunk, but what happens on hash collision when data is not exactly the same?
  • considering that your chunks are rather large, you probably wont see much of improvement for bulk insert vs fire-and-forget single inserts.
    Often times the bottleneck will be disk, and single inserts can use multiple connections / server threads for processing.
    Let us know if you see much of a difference
Comment by auto [ 23/May/11 ]

Author:

{u'login': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: Add InsertOption_KeepGoing to keep going after error on bulk insert. SERVER-509
Branch: master
https://github.com/mongodb/mongo/commit/b690e237fd7055ad1da8950882c62b4fab82baee

Comment by auto [ 23/May/11 ]

Author:

{u'login': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: minor refactor to prep for SERVER-509
Branch: master
https://github.com/mongodb/mongo/commit/0e28f89602e84f6ea6009cf5d5d91da675c1d199

Comment by Benjamin Darfler [ 18/Apr/11 ]

If possible it would be nice to return the items that were not inserted or otherwise give feedback as to which ones failed.

Comment by Knut Forkalsrud [ 23/Mar/11 ]

In my use case I could take advantage of being the only client inserting into the collection.
That way I could insert at will and after each batch I check the size of the collection. When the
size is smaller than expected I can easily calculate which element in my batch caused the failure and discard/retry accordingly.

This use case may be common enough that it might make sense to support it is some library form, maybe even the driver.

Comment by ofer fort [ 22/Mar/11 ]

this is something we'd also love to have, as it would reduce our calls to insert dramatically.

Comment by Eliot Horowitz (Inactive) [ 21/Jan/11 ]

To do this, all driver APIs will need to change.
Should do at beginning of next cycle so they have time to change.

Comment by Dwight Merriman [ 14/Mar/10 ]

yes this makes sense

given the chunks are pretty big though, i think you will find singleton inserts to be very fast if you do not call getlasterror after each insert.

Generated at Thu Feb 08 02:54:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.