[COMPASS-4163] Imported large CSV file into Compass creates document count mismatch Created: 24/Feb/20  Updated: 29/Oct/23  Resolved: 09/Mar/20

Status: Closed
Project: Compass
Component/s: Compass, Import/Export
Affects Version/s: 1.20.5
Fix Version/s: 1.21.0

Type: Bug Priority: Major - P3
Reporter: Felicia Hsieh Assignee: Lucas Hrabovsky (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

MacOS 10.14.6
Compass 1.20.5 Stable


Attachments: PNG File Screen Shot 2020-02-24 at 2.03.46 PM.png     PNG File Screen Shot 2020-02-24 at 2.05.46 PM.png     PNG File Screen Shot 2020-02-24 at 2.17.08 PM.png     PNG File Screen Shot 2020-02-24 at 2.17.25 PM.png     PNG File Screen Shot 2020-02-24 at 2.40.14 PM.png     PNG File Screenshot 2020-03-09 13.30.57.png     PNG File Screenshot 2020-03-09 13.31.19.png     PNG File Screenshot 2020-03-09 14.07.34.png    
Issue Links:
Problem/Incident
causes COMPASS-4198 Support headerless .csv imports Closed
Sprint: Iteration Yak

 Description   

Used Compass > Import Collection > CSV > local file of a very large CSV file (3.7GB) from http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv from 
https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads

creates error after 20 sec of attempting to complete import. After clicking on "Import" button to continue, it seems to try again. Screenshot of database collections shows "0" documents, but inspecting the collection shows that 2.9m documents were saved.

Perhaps also give guidance to Compass users what the max size of an import file that could be attempted.

 



 Comments   
Comment by Lucas Hrabovsky (Inactive) [ 09/Mar/20 ]

On current master import took ~30min against localhost:

All 25M documents created, but.... the source .csv is headerless so field names are munged to be values of the first row:

I created COMPASS-4198 for triage to look into supporting headerless .csv in the UI.

Work arounds for not having wacky field names in the mean time:

1. Manually add the headers as the first line of that .csv before importing. My best guess as to the field names based on docs

"property_id","price","date","postal_code","field_1","field_2","field_3","paon","saon","street","locality","town_or_city","district","county","field_4","field_5"

2. Use the aggregation pipeline to clean it up (make a view or $out/$merge). Here are the field mappings you'll need for project if you use that original .csv without adding headers:

use testing;
 
db.ppdComplete.aggregate([
  {
    $project: { 
      price: { $toInt: "$122000" },
      property_id: "${680A511A-C9CD-4E26-84EC-B6697A53CEF8}",
      date: "$1995-09-29 00:00",
      postal_code: "$NG14 5DE",
      paon: "$14",
      street: "$HILLCREST GARDENS",
      locality: "$BURTON JOYCE",
      town_or_city: "$NOTTINGHAM",
      district: "$GEDLING",
      county:"$NOTTINGHAMSHIRE"
    }
  }, 
  {$out: 'ppdComplete_wrangled'}
])

Perhaps also give guidance to Compass users what the max size of an import file that could be attempted.

We currently have no maximum size constraint on the client side.

Generated at Wed Feb 07 22:35:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.