[SERVER-14666] mongoimport skips records when _id field is custom-populated Created: 23/Jul/14  Updated: 10/Dec/14  Resolved: 24/Jul/14

Status: Closed
Project: Core Server
Component/s: Tools
Affects Version/s: 2.4.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Clark Freifeld Assignee: Ramon Fernandez Marina
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

Unfortunately, I don't have a sample dataset on me right now that I can share publicly. If this bug is not completely shot down, I will generate one and attach it to this bug report.

To reproduce, try importing a large file of json records with _id populated into an empty collection, e.g.

/usr/bin/mongoimport -d mydb -c mycollection --stopOnError --file largefile.json

Then go into the db and count the records in the collection and see if it matches the number of records in the file.

Participants:

 Description   

The mongoimport tool, when applied to a large file of json records (say 200k+ records), where the _id field is populated with a user-defined (unique) value, appears to arbitrarily skip records, without reporting any error.

I'm not the only one to have experienced this issue:

http://wi2ki.blogspot.com/2012/11/mongodb-user-mongoimport-did-not-load.html
http://wi2ki.blogspot.com/2012/11/mongodb-user-re-mongoimport-did-not.html

Apparently the import works correctly if the _id field is not populated and mongo is left to generate the _id by itself.

I suspect this is the bug because I have a script now that goes through the original imported json file and checks each _id against the db. If it's missing, it dumps the record into a new file, then imports that file. That works fine, presumably due to the smaller number of records in the new file. It appears to be skipping on the order of 1% of the records in the original file, so a few thousand out of a few hundred thousand.

I'm marking this as Major. Though there is a workaround, this issue caused us no end of headaches around data integrity, which is pretty important to most projects.



 Comments   
Comment by Ramon Fernandez Marina [ 24/Jul/14 ]

clarkfreifeld, I'm not able to reproduce this problem. Since you didn't provide a dataset I first created one and exported it to a file:

> db.foo.drop()
> for (i=0; i<1000000; i++) {db.foo.insert({_id:"id_"+i, c:1, v:Math.floor(100*Math.random())})}
> db.foo.find().itcount()
1000000
> exit
bye
$ mongoexport -d test -c foo > data.js
$ wc -l data.js 
1000000 data.js

Then I imported this dataset with mongoimport as you describe:

$ mongoimport -d new -c bar --stopOnError --file data.js 
connected to: 127.0.0.1
Thu Jul 24 10:31:26.153                 Progress: 8750290/42788907      20%
Thu Jul 24 10:31:26.153                         204500  68166/second
Thu Jul 24 10:31:29.001                 Progress: 16375280/42788907     38%
Thu Jul 24 10:31:29.001                         382700  63783/second
Thu Jul 24 10:31:32.083                 Progress: 24103063/42788907     56%
Thu Jul 24 10:31:32.083                         563300  62588/second
Thu Jul 24 10:31:35.097                 Progress: 32224369/42788907     75%
Thu Jul 24 10:31:35.097                         753100  62758/second
Thu Jul 24 10:31:38.185                 Progress: 40349847/42788907     94%
Thu Jul 24 10:31:38.185                         943000  62866/second
Thu Jul 24 10:31:39.038 check 9 1000000
Thu Jul 24 10:31:39.773 imported 1000000 objects
$ mongo
MongoDB shell version: 2.4.10
connecting to: test
> use new
switched to db new
> db.bar.find().itcount()
1000000

Have you checked that your _id fields are indeed unique? Because if they're not, chances are you're running into SERVER-12045.

I'm going to resolve this as "Can't reproduce", but if someone can upload a dataset that reproduces the problem I'll be happy to re-open the ticket and investigate further.

Regards,
Ramón.

Generated at Thu Feb 08 03:35:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.