[CSHARP-741] Using InsertBatch with a very large dataset (~300MB CSV) Created: 14/May/13 Updated: 05/Apr/19 Resolved: 11/Jun/13 |
|
| Status: | Closed |
| Project: | C# Driver |
| Component/s: | None |
| Affects Version/s: | 1.8.1 |
| Fix Version/s: | None |
| Type: | Task | Priority: | Blocker - P1 |
| Reporter: | Nic Pillinger | Assignee: | Robert Stam |
| Resolution: | Done | Votes: | 0 |
| Labels: | question | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
I'm trying to use InsertBatch to process a ~300MB file. I use File.Readlines to enumerate the lines in the file and turn them into BsonDocuments. I pass the LINQ enumerator to InsertBatch - it just hangs. It works fine with a small file. It does the same thing if I manually batch by using Skip & Take on my LINQ query which is pulling lines out of the file. If I take the first 5000 lines and pass them to InsertBatch the first call works and the second 5000 batch just hangs. I thought it might be something to do with one of these but am pretty stuck to be honest. Help! code snippet: var postcodesToInsert = File }); postcodesCollection.InsertBatch(postcodesToInsert); thanks |
| Comments |
| Comment by Robert Stam [ 16/May/13 ] | ||||||||||
|
The WriteConcernResults do not include the number of documents inserted because the server doesn't return that information (the WriteConcernResult class is just wrapping the server's response to getLastError). Normally an InsertBatch stops at the first error encountered. So if you had a unique index on postcode the first duplicate postcode would stop processing of the rest of the batch (documents before the one with the duplicate postcode will already have been inserted and remain). You could use the InsertFlags.ContinueOnError flag to indicate that you want the server to continue with the rest of the batch even after an error is encountered. In that case, documents that result in an error are simply skipped. One thing to keep in mind though is that ContinueOnError will apply to all errors, not just the duplicate key error on postcode. | ||||||||||
| Comment by Nic Pillinger [ 15/May/13 ] | ||||||||||
|
Brilliant, thanks Robert. I can see this is working sorry for the false alarm. One last question is possible to find out how many records the driver has inserted? Does the writeconcern collection returned tell me this? And also if I have a unique index on the postcode field is it possible to use insertbatch without getting errors (I just want to insert what I dont have and bin duplicates). thanks again | ||||||||||
| Comment by Robert Stam [ 15/May/13 ] | ||||||||||
|
Right. If you ask InsertBatch to insert 5 million documents over a slow network it's going to take awhile... InsertBatch enumerates the documents one at a time and adds them to a message to be sent to the server. The server has a max message size of 48MB, so once the message is full it is sent to the server and then a new message is started with the following documents. We call these sub-batches. The driver will continue sending sub-batches until it the enumeration runs out of documents. You could definitely consider splitting the set of documents into smaller batches yourself. That way your call to InsertBatch wouldn't block for as long and you could do other things between the smaller batches, like print out progress reports. | ||||||||||
| Comment by Nic Pillinger [ 15/May/13 ] | ||||||||||
|
Locally that is indeed the case - the original problem I had I was pointing at an instance on MongoHQ. So I guess this is a network latency issue. Does InsertBatch insert/enumerate the entire enumerable in one go? Should I be splitting it into multiple calls by using Skip/Take to pull lines from the file in batches? | ||||||||||
| Comment by Nic Pillinger [ 14/May/13 ] | ||||||||||
|
thanks Robert, just after posting I had the same thought. I'm running in a VM on my iMac so its definitely not the fastest of boxes. I'm going to give it another shot and see if its just actually inserting records. | ||||||||||
| Comment by Robert Stam [ 14/May/13 ] | ||||||||||
|
I attempted to reproduce using a file with 5 million lines of about 60 characters each (just under 300MB). I used a 9 character postcode and used a filler for the rest of the line. Using the code you posted it worked fine for me. However: it did take 77 seconds for the InsertBatch to return. Keep in mind that it inserted 5 million documents in 77 seconds, which works out to just over 64 thousand per second. Also, I have a fast machine with fast SSD. If your database is on a spinning hard drive it could take much longer to insert that much data. Are you sure that it was really hung? Could it have been making progress all along and you just didn't wait long enough? It's worth checking for that possibility. I used the mongo shell to monitor the progress along the way:
| ||||||||||
| Comment by Robert Stam [ 14/May/13 ] | ||||||||||
|
I will try to reproduce. Can you tell me how many lines are in your CSV file and the average length of a post code? I would like to create test data that resembles yours. Looks like you are extracting the post code from the first field in the CSV file and ignoring the rest of the fields, right? Also, you are letting the driver add the "_id" field for you automatically, correct? It would also be helpful to look and see if there are any unusual entries in the server log around the same time as the InsertBatch hangs. |