[SERVER-3276] mongoimport is trimming leading whitespace (including tabs) from every input record Created: 16/Jun/11  Updated: 12/Jul/16  Resolved: 27/Jun/11

Status: Closed
Project: Core Server
Component/s: Tools
Affects Version/s: 1.9.0
Fix Version/s: 1.9.1

Type: Bug Priority: Minor - P4
Reporter: richard bucker Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: mongoimport
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

n/a


Issue Links:
Related
related to TOOLS-61 mongoimport tab-delimited files retai... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

I have been able to confirm that this is likely a bug, however, as we all know one programmers bug is another's feature. That said.... Here are the particulars:

FILE: tools/import.cpp

This piece of code... while reading the input file, (skip the JSON part), it trims all of the leading whitespace. However, if this is a TSV then the tab will be gobbled up. I don't think this was the intended behavior.

292 if (_jsonArray) {
293 while (buf[0] != '{' && buf[0] != '\0')

{ 294 len++; 295 buf++; 296 }


297 if (buf[0] == '\0')
298 break;
299 }
300 else {
301 while (isspace( buf[0] ))

{ 302 len++; 303 buf++; 304 }


305 if (buf[0] == '\0')
306 continue;
307 len += strlen( buf );
308 }
309

http://creativyst.com/Doc/Articles/CSV/CSV01.htm

I do not every reading a formal spec for CSV but this link is pretty good. In general, however, it's a bug in the design to trim the leading part of the record in this location in the code. The parser should be located as tightly and closely as possible... for the obvious reasons.

/r



 Comments   
Comment by auto [ 27/Jun/11 ]

Author:

{u'login': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}

Message: Fix SERVER-3276 - mongoimport stripping leading tabs when importing TSV files
Branch: master
https://github.com/mongodb/mongo/commit/685203a89cc8f25d01bc197a2224c1ccaa6519f6

Comment by richard bucker [ 16/Jun/11 ]

Here is my corrected code with some inline comments.

  • it is OK to delete leading whitespace from a JSON document
  • it is ok to delete whitespace adjacent to the separator
  • the above does not say anything about the leading spaces
  • since a horizontal tab \t is considered a whitespace, this needs to be omitted from the scan in a TSV

// handle leading whitespace
if (_jsonArray) {
// it's json - ok to delete the leading WS
while (buf[0] != '{' && buf[0] != '\0')

{ len++; buf++; }

if (buf[0] == '\0')
break;
}
else {
// it's everything else, not OK to delete leading WS (certainly not tabs in a TSV)
// only supposed to delete the WS adjacent to the sep.
if (_type == TSV) {
while (buf[0] != '\t' && isspace( buf[0] ))

{ len++; buf++; }
}
else {
while (isspace( buf[0] )) { len++; buf++; }

}
if (buf[0] == '\0')
continue;
len += strlen( buf );
}

Generated at Thu Feb 08 03:02:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.