[SERVER-14913] mongoimport imports csv incorrectly when in the presence of even number of escaped quotes Created: 15/Aug/14  Updated: 10/Dec/14  Resolved: 28/Aug/14

Status: Closed
Project: Core Server
Component/s: Tools
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Andrew Erlichson Assignee: Matt Kangas
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to DOCS-3969 Need warning that our CSV format foll... Closed
Operating System: ALL
Steps To Reproduce:

Create a file called bad.csv

bad.csv

"first","last"
"joe","smith"
"bad","guy\""
"evil","monster"
"sam\"","mill"

Now import it

desktop:MongoDB aje$ mongoimport --type csv -c bad --drop --headerline < bad.csv
connected to: 127.0.0.1
2014-08-15T11:09:02.225-0400 dropping: test.bad
2014-08-15T11:09:02.254-0400 imported 2 objects

Now let's look at the collection:

m101:PRIMARY> db.bad.find().pretty()
{
	"_id" : ObjectId("53ee228e35d4ea0c46429cac"),
	"first" : "joe",
	"last" : "smith"
}
{
	"_id" : ObjectId("53ee228e35d4ea0c46429cad"),
	"first" : "bad",
	"last" : "guy\\\"\n",
	"field2" : ",",
	"field3" : "\n",
	"field4" : "",
	"field5" : "mill"
}
m101:PRIMARY> 

There should be four documents, but there are only two.

Participants:

 Description   

When a csv file contains an even number of escaped quotes put in as \", the parser gets confused and reads across line endings, coalescing multiple lines into a single document.

Wikipedia says that embedded quotes need to be encoded as "", so arguably, the csv file did not conform to Jimmy Wales's view of CSV, but this particular encoding is the default used by mysql, so we probably need it to work, or at least throw an error.



 Comments   
Comment by Matt Kangas [ 28/Aug/14 ]

mongoimport's CSV parser conforms to RFC 4180, which specifies:

7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:

"aaa","b""bb","ccc"

Backslash is not a valid escape character per the RFC spec.

Of course, the root problem is that CSV was poorly specified for a long time, so considerable differences exist among implementations. If we should add an option to mongoimport to support MySQL's variant of CSV, please let me and mpobrien know.

Generated at Thu Feb 08 03:36:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.