mongoimport skips records when _id field is custom-populated

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Cannot Reproduce
    • Priority: Major - P3
    • None
    • Affects Version/s: 2.4.10
    • Component/s: Tools
    • None
    • ALL
    • Hide

      Unfortunately, I don't have a sample dataset on me right now that I can share publicly. If this bug is not completely shot down, I will generate one and attach it to this bug report.

      To reproduce, try importing a large file of json records with _id populated into an empty collection, e.g.

      /usr/bin/mongoimport -d mydb -c mycollection --stopOnError --file largefile.json

      Then go into the db and count the records in the collection and see if it matches the number of records in the file.

      Show
      Unfortunately, I don't have a sample dataset on me right now that I can share publicly. If this bug is not completely shot down, I will generate one and attach it to this bug report. To reproduce, try importing a large file of json records with _id populated into an empty collection, e.g. /usr/bin/mongoimport -d mydb -c mycollection --stopOnError --file largefile.json Then go into the db and count the records in the collection and see if it matches the number of records in the file.
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      The mongoimport tool, when applied to a large file of json records (say 200k+ records), where the _id field is populated with a user-defined (unique) value, appears to arbitrarily skip records, without reporting any error.

      I'm not the only one to have experienced this issue:

      http://wi2ki.blogspot.com/2012/11/mongodb-user-mongoimport-did-not-load.html
      http://wi2ki.blogspot.com/2012/11/mongodb-user-re-mongoimport-did-not.html

      Apparently the import works correctly if the _id field is not populated and mongo is left to generate the _id by itself.

      I suspect this is the bug because I have a script now that goes through the original imported json file and checks each _id against the db. If it's missing, it dumps the record into a new file, then imports that file. That works fine, presumably due to the smaller number of records in the new file. It appears to be skipping on the order of 1% of the records in the original file, so a few thousand out of a few hundred thousand.

      I'm marking this as Major. Though there is a workaround, this issue caused us no end of headaches around data integrity, which is pretty important to most projects.

            Assignee:
            Ramon Fernandez
            Reporter:
            Clark Freifeld
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: