[SERVER-37575] BinData compare uses the Base64 string not the raw bytes Created: 11/Oct/18  Updated: 07/Apr/23  Resolved: 22/Oct/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.16
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Glen Miner Assignee: Asya Kamsky
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux x64


Operating System: ALL
Steps To Reproduce:

I can make 3 binary encoded IPv6 addresses like this:
 

> var a = BinData(0,"JA4A8gAxqTwciCuF5GGzAA==")
> var b = BinData(0,"JA4A8gAxqTwciCuF5GGzTw==")
> var c = BinData(0,"JA4A8gAxqTwciCuF5GGz/w==")

You can confirm these sort properly a < b < c in hex like this:

> a.hex()
240e00f20031a93c1c882b85e461b300
> b.hex()
240e00f20031a93c1c882b85e461b34f
> c.hex()
240e00f20031a93c1c882b85e461b3ff

And of course as you would expect the shell agrees:

> a.hex() < b.hex()
true
> b.hex() < c.hex()
true
> a.hex() < c.hex()
true

However if I compare the objects themselves I get a peculiar answer:

> a < b
true
> b < c
false
> a < c
false

After some digging this peculiar result seemed to match the string-compare of the base-64 encoded data:

> a.toString() < b.toString()
true
> b.toString() < c.toString()
false
> a.toString() < c.toString()
false
 

Which tragically is inconsistent with the "byte-wise lexicographic sort" implied by the documentation. Anecdotal evidence suggests that server-side sorting has the same behavior.

Participants:

 Description   

https://docs.mongodb.com/manual/reference/bson-type-comparison-order/#bindata

Says MongoDB sorts BinData in the following order:

.... Finally, by the data, performing a byte-by-byte comparison.
 
We had engineered this around the expectation that this order would compare the bytes in the raw data however we have discovered to our dismay that it appears to be comparing the Base64-encoded strings instead which is not the same order as comparing the bytes of the raw data.
Note that this this doesn't just apply to the shell; I was led to this conclusion when queries were not returning data I was expecting and assume I could demonstrate the same behavior with a server-side query as well.

This is actually a serious bug because it means you can't properly sequence binary data (like IPv6 addresses) in a way that supports range compares but I have no expectation that this will actually be fixed –  I suspect instead that this will turn into a documentation bug that warns others off from using this datatype.

 



 Comments   
Comment by Glen Miner [ 31/Oct/18 ]

Thank you for investigating! I'm not sure if it's worth making a new bug or migrating this one for the shell compare operators – I don't care about that personally but it might be a trap people fall into like I did. 

I retraced my steps and I was seeing this with aggregate commands matching on $gte $lte bounds but I cannot reproduce this with the test data above. It's possible that a TTL index was removing items that made aggregates come back empty but I kind of doubt it. I will re-open when if I see it again.

Thanks again!

Comment by Asya Kamsky [ 22/Oct/18 ]

> Note that this this doesn't just apply to the shell;

This only seems to apply to the shell - when storing these values in the server, I can only get correct results, I cannot reproduce any query or aggregation where comparison happens as a string nor returns unexpected/incorrect result:

> db.binary.drop()
true
> db.binary.insert({a:a, b:b, c:c})
WriteResult({ "nInserted" : 1 })
> db.binary.aggregate({$addFields:{aLessB:{$lt:["$a","$b"]}, bLessC:{$lt:["$b", "$c"]}, aLessC:{$lt:["$a","$c"]}}}).pretty()
{
	"_id" : ObjectId("5bce4aac64bac589b170f3d0"),
	"a" : BinData(0,"JA4A8gAxqTwciCuF5GGzAA=="),
	"b" : BinData(0,"JA4A8gAxqTwciCuF5GGzTw=="),
	"c" : BinData(0,"JA4A8gAxqTwciCuF5GGz/w=="),
	"aLessB" : true,
	"bLessC" : true,
	"aLessC" : true
}
> db.binary.drop()
true
> db.binary.insert({ip:a})
WriteResult({ "nInserted" : 1 })
> db.binary.insert({ip:b})
WriteResult({ "nInserted" : 1 })
> db.binary.insert({ip:c})
WriteResult({ "nInserted" : 1 })
> db.binary.find().sort({ip:1})
{ "_id" : ObjectId("5bce497264bac589b170f3cc"), "ip" : BinData(0,"JA4A8gAxqTwciCuF5GGzAA==") }
{ "_id" : ObjectId("5bce497464bac589b170f3cd"), "ip" : BinData(0,"JA4A8gAxqTwciCuF5GGzTw==") }
{ "_id" : ObjectId("5bce497764bac589b170f3ce"), "ip" : BinData(0,"JA4A8gAxqTwciCuF5GGz/w==") }
> db.binary.find().sort({ip:-1})
{ "_id" : ObjectId("5bce497764bac589b170f3ce"), "ip" : BinData(0,"JA4A8gAxqTwciCuF5GGz/w==") }
{ "_id" : ObjectId("5bce497464bac589b170f3cd"), "ip" : BinData(0,"JA4A8gAxqTwciCuF5GGzTw==") }
{ "_id" : ObjectId("5bce497264bac589b170f3cc"), "ip" : BinData(0,"JA4A8gAxqTwciCuF5GGzAA==") }
> db.binary.find({ip:{$lt:a}})
> db.binary.find({ip:{$lt:b}})
{ "_id" : ObjectId("5bce4c6f64bac589b170f3d1"), "ip" : BinData(0,"JA4A8gAxqTwciCuF5GGzAA==") }
> db.binary.find({ip:{$lt:c}})
{ "_id" : ObjectId("5bce4c6f64bac589b170f3d1"), "ip" : BinData(0,"JA4A8gAxqTwciCuF5GGzAA==") }
{ "_id" : ObjectId("5bce4c7264bac589b170f3d2"), "ip" : BinData(0,"JA4A8gAxqTwciCuF5GGzTw==") }
> db.binary.find({ip:{$lte:c}})
{ "_id" : ObjectId("5bce4c6f64bac589b170f3d1"), "ip" : BinData(0,"JA4A8gAxqTwciCuF5GGzAA==") }
{ "_id" : ObjectId("5bce4c7264bac589b170f3d2"), "ip" : BinData(0,"JA4A8gAxqTwciCuF5GGzTw==") }
{ "_id" : ObjectId("5bce4c7464bac589b170f3d3"), "ip" : BinData(0,"JA4A8gAxqTwciCuF5GGz/w==") }

I'm closing this ticket as "Cannot reproduce" but if you can find an example of data/query that returns the wrong results from the server please reopen and include any details you can about exact query so we can reproduce and fix it.

Generated at Thu Feb 08 04:46:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.