[CDRIVER-652] Number formatting and whitespace in bson_as_json Created: 14/May/15  Updated: 30/Sep/19  Resolved: 29/May/15

Status: Closed
Project: C Driver
Component/s: json, libbson
Affects Version/s: None
Fix Version/s: 1.2-beta0

Type: Improvement Priority: Major - P3
Reporter: Jeroen Ooms [X] Assignee: A. Jesse Jiryu Davis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to CDRIVER-2063 JSON export prints insignificant digi... Closed
related to CDRIVER-3377 Double value retrieved from bson is d... Closed

 Description   

I am implementing a client version of `mongoexport` in the R driver, which uses bson_as_json to convert bson records to json lines which are then streamed into file or connection. It works, but the bson to json conversion is suboptimal.

{ "_id" : { "$oid" : "5555102760f0cc03b65c8331" }, "Sepal.Length" : 5.100000, "Sepal.Width" : 3.500000, "Petal.Length" : 1.400000, "Petal.Width" : 0.200000, "Species" : "setosa" }
{ "_id" : { "$oid" : "5555102760f0cc03b65c8332" }, "Sepal.Length" : 4.900000, "Sepal.Width" : 3, "Petal.Length" : 1.400000, "Petal.Width" : 0.200000, "Species" : "setosa" }
{ "_id" : { "$oid" : "5555102760f0cc03b65c8333" }, "Sepal.Length" : 4.700000, "Sepal.Width" : 3.200000, "Petal.Length" : 1.300000, "Petal.Width" : 0.200000, "Species" : "setosa" }

There are at least two issues. First there is unnecessary whitespace, which is undesired. A bigger issue is the number formatting. It seems like libbson prints doubles with fixed digits which results in trailing zero's or loss of precision for small numbers.

By comparison, the real `mongoexport` utility outputs this for the same data:

{"Petal.Length":1.4,"Petal.Width":0.2,"Sepal.Length":5.1,"Sepal.Width":3.5,"Species":"setosa","_id":{"$oid":"5555102760f0cc03b65c8331"}}
{"Petal.Length":1.4,"Petal.Width":0.2,"Sepal.Length":4.9,"Sepal.Width":NumberInt(3),"Species":"setosa","_id":{"$oid":"5555102760f0cc03b65c8332"}}
{"Petal.Length":1.3,"Petal.Width":0.2,"Sepal.Length":4.7,"Sepal.Width":3.2,"Species":"setosa","_id":{"$oid":"5555102760f0cc03b65c8333"}}

Ideally output from bson_as_json would be identical to mongoexport, but I understand yajl might have its limitations.



 Comments   
Comment by A. Jesse Jiryu Davis [ 11/Jul/16 ]

Thanks for the info! Right now, none of the drivers distinguish ints from floats when they export JSON. We may change our minds in the future.

Comment by Jeroen Ooms [X] [ 11/Jul/16 ]

Here is an old issue but I found that yajl also exports doubles with at least one decimal to distingiush them from integers. It does so simply by first printing the number, and then adding `.0` if the number only consists of `-0123456789` characters:

https://github.com/mongodb/libbson/blob/e1c1516a64a39cc0cb8b9c31b14099a3c926cfb0/src/yajl/yajl_gen.c#L233-L236

This seems more reliable than fmod and also prevents the problems above when doubles get printed in scientific notation.

Comment by Githook User [ 11/Jan/16 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-652 just use "%g" for floats in bson_as_json
Branch: 1.3.0-dev
https://github.com/mongodb/libbson/commit/2fd5fb266f6280855e80eea527e9b87007e98f1f

Comment by Githook User [ 11/Jan/16 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-652 nicer float format in bson_as_json

Format with flexible precision after the decimal point, and for convenience
always include one digit after the point if the BSON type is double.
Branch: 1.3.0-dev
https://github.com/mongodb/libbson/commit/eeb6192ad8f83d20fbde2f2c515692649dd5dc76

Comment by Githook User [ 20/Oct/15 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-652 just use "%g" for floats in bson_as_json
Branch: debian
https://github.com/mongodb/libbson/commit/2fd5fb266f6280855e80eea527e9b87007e98f1f

Comment by Githook User [ 20/Oct/15 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-652 nicer float format in bson_as_json

Format with flexible precision after the decimal point, and for convenience
always include one digit after the point if the BSON type is double.
Branch: debian
https://github.com/mongodb/libbson/commit/eeb6192ad8f83d20fbde2f2c515692649dd5dc76

Comment by Githook User [ 07/Oct/15 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-652 "%g" for floats in bson_as_json in 1.2.x
Branch: master
https://github.com/mongodb/libbson/commit/9eb14d12c8a3495e0c99f2fb6238167f050df559

Comment by Githook User [ 01/Oct/15 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-652 just use "%g" for floats in bson_as_json
Branch: 1.2.0-dev
https://github.com/mongodb/libbson/commit/2fd5fb266f6280855e80eea527e9b87007e98f1f

Comment by Githook User [ 01/Oct/15 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-652 nicer float format in bson_as_json

Format with flexible precision after the decimal point, and for convenience
always include one digit after the point if the BSON type is double.
Branch: 1.2.0-dev
https://github.com/mongodb/libbson/commit/eeb6192ad8f83d20fbde2f2c515692649dd5dc76

Comment by A. Jesse Jiryu Davis [ 29/May/15 ]

1.2.0: https://github.com/mongodb/libbson/commit/9eb14d12c8a3495e0c99f2fb6238167f050df559

1.1.7: https://github.com/mongodb/libbson/commit/2fd5fb266f6280855e80eea527e9b87007e98f1f

Comment by Jeroen Ooms [X] [ 29/May/15 ]

I hope that means

%.15g

for all doubles

Comment by A. Jesse Jiryu Davis [ 29/May/15 ]

I now agree with you, and additionally think it should not be a goal to distinguish ints and doubles. I'll do the simplest thing possible that conforms with the JSON spec.

Comment by Jeroen Ooms [X] [ 28/May/15 ]

Actually I take that back, I don't think forcing a decimal notation for doubles is a good idea. It leads to really poor formatting for large numbers. For example with the current implementation, the numbers 10^25 up till 10^30 are printed with a lot of non-significant noise:

{ "_id" : { "$oid" : "556790b7ba7cb2118851dbe8" }, "x" : 10000000000000000905969664.0 }
{ "_id" : { "$oid" : "556790b7ba7cb2118851dbe9" }, "x" : 100000000000000004764729344.0 }
{ "_id" : { "$oid" : "556790b7ba7cb2118851dbea" }, "x" : 1000000000000000013287555072.0 }
{ "_id" : { "$oid" : "556790b7ba7cb2118851dbeb" }, "x" : 9999999999999999583119736832.0 }
{ "_id" : { "$oid" : "556790b7ba7cb2118851dbec" }, "x" : 99999999999999991433150857216.0 }
{ "_id" : { "$oid" : "556790b7ba7cb2118851dbed" }, "x" : 1000000000000000019884624838656.0 }

With scientific notation you only get to see the actual significant digits, which is a more accurate representation of the value:

{ "_id" : { "$oid" : "556790b7ba7cb2118851dbe8" }, "x" : 1e+25 }
{ "_id" : { "$oid" : "556790b7ba7cb2118851dbe9" }, "x" : 1e+26 }
{ "_id" : { "$oid" : "556790b7ba7cb2118851dbea" }, "x" : 1e+27 }
{ "_id" : { "$oid" : "556790b7ba7cb2118851dbeb" }, "x" : 1e+28 }
{ "_id" : { "$oid" : "556790b7ba7cb2118851dbec" }, "x" : 1e+29 }
{ "_id" : { "$oid" : "556790b7ba7cb2118851dbed" }, "x" : 1e+30 }

I think it's better to stick with scientific notation for all numbers, which works for both very large and very small numbers, and only prints actual signal from the number.

Comment by Jeroen Ooms [X] [ 28/May/15 ]

Looks good. I haven't tested this yet, but if you make yajl parse whole numbers into integers then you can probably roundtrip numbers without loss of type, which would be very nice.

Comment by A. Jesse Jiryu Davis [ 28/May/15 ]

Thanks for the fix. Additionally I added, for convenience, something mongoexport doesn't do: libbson's bson_as_json formats BSON doubles like "1.0" and integers like "1". JSON doesn't distinguish between them (all numbers are floats in JSON) and mongoexport doesn't, either. Thoughts?

https://github.com/mongodb/libbson/commit/eeb6192a

Comment by Jeroen Ooms [X] [ 28/May/15 ]

FYI the related issue with integer formatting has been resolved in mongoexport: https://jira.mongodb.org/browse/TOOLS-741

Comment by Jeroen Ooms [X] [ 16/May/15 ]

Yes that makes sense. I don't mind too much about whitespace, it's just a bit of overhead but not as big of a problem as the number formatting.

Comment by A. Jesse Jiryu Davis [ 16/May/15 ]

Seems wise. Rather than changing how whitespace is displayed – someone else's code might rely on the way it's displayed now – I'd prefer to fix the bug and stop there. In the future I may add APIs to override options on the JSON formatter, not just whitespace but also indentation.

Comment by Jeroen Ooms [X] [ 15/May/15 ]

I was able to fix the number formatting problem: https://github.com/mongodb/libbson/pull/127. It is a simple fix that changes the number formatting for real numbers to the sensible default. Hope you can find a minute to review it.

Taking out the whitespace seems quite easy as well but it requires a lot of small changes and is probably a bit more controversial, so I'll leave that alone for now.

Comment by A. Jesse Jiryu Davis [ 14/May/15 ]

Thanks Jeroen, I'd like to add features to the JSON generator as well, but it's low-priority in the scheme of things. The integer formatting is certainly a problem that must be fixed eventually, however.

Comment by Jeroen Ooms [X] [ 14/May/15 ]

There is actually a bug in mongoexport for integers, that is obviously not the desired behavior: https://jira.mongodb.org/browse/TOOLS-741

Generated at Wed Feb 07 21:10:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.