[SERVER-5030] Document equality should be independent of field insertion order Created: 21/Feb/12  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Glenn Maynard Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 9
Labels: bson2, query_triage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File server5030.js    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-22817 Duplicate _id in the same collection ... Closed
is duplicated by SERVER-22529 search order for embedded documents, ... Closed
is duplicated by SERVER-30392 Query does not return a document when... Closed
Related
related to SERVER-30705 Concurrent updates and FCV change can... Closed
is related to SERVER-30197 $eq-ordered and $eq-unordered Backlog
Assigned Teams:
Query Execution
Participants:
Case:

 Description   

Equality of documents should not depend on the order of the fields. The following documents should be considered equal:

{a: 1, b: 2} {b: 2, a: 1}

The above documents would be considered the same document for purposes of unique indexes, and would be considered a match for searching.

This allows documents to map much more naturally to the native, built-in dictionary types in various languages. Most native dictionary types don't preserve order, and builtin dictionary types that support equality (Python, Ruby) don't compare insertion order. Ruby, which does explicitly preserve insertion order, also still does not consider insertion order for equality; {:a=>1, :b=>2} == {:b=>2, :a=>1} is true.

This would reduce the need to use inconvenient non-builtin dictionary types, and align documents to typical expectations of dictionary types.

Note: this is not a recommendation to stop preserving insertion order. Although I don't find that useful, it's not obviously harmful, and it should be possible to do both efficiently.

Additional discussion at http://groups.google.com/group/mongodb-user/browse_thread/thread/c908e572d6e89c49.



 Comments   
Comment by Daniel Sinclair [ 08/Oct/13 ]

I totally agree. Don't get me wrong, I don't want to have order dependent
code and right now it seems in conflict that the driver's don't preserve
order whereas the DB is order-important. As a conflict, it's either a
driver bug/oversight or a DB one.

Since it might hamper DB performance, and we can't guarantee order in the
languages, it sounds like most practical solution would be to consistently
order fields in all drivers and therefore by convention in the DB and
elsewhere.

Either way - you're absolutely right, it's a real issue that needs to be
addressed.

Comment by Glenn Maynard [ 07/Oct/13 ]

The issue isn't that it's completely impossible to preserve dictionary order. The problem is that it's easy to get lurking bugs in day-to-day user code. For example, it's easy to write

{"a": 1, "b": 2}

in Python, and without realizing it depend on the key order that Python's hashing happens to give you. You'll often get the same key order every time, so everything seems to work, but it's actually depending on the hash behavior of your current version of Python, so it'll break if Python changes its hash algorithm, and it won't work in any language except Python. I've done this myself without noticing what I was doing, even though I know about it, and it's very easy for this to slip through testing.

It's also painful to not be able to use dictionary literals in most languages.

It doesn't mean that it can't be done at all; drivers could easily guarantee a particular order in their BSON handling if that was wanted. The issue here is about sanity of user code, but the drivers can handle this however they want internally.

Comment by Daniel Sinclair [ 07/Oct/13 ]

Isn't the main issue that most languages don't preserve dictionary order?
If true, how would you manage that in the driver if field order remains
important? Presumably some mappings couldn't be supported?

Comment by Daniel Sinclair [ 05/Oct/13 ]

So can you ensure a particular field order from every language ? I'm
thinking specifically about dictionary mappings here

Comment by Eliot Horowitz (Inactive) [ 05/Oct/13 ]

Glenn, yes, unless in bson 2 we declare all bson document fields are sorted by key name.

Or say that bson 2 documents compare differently than 1 docs in this regard.

Comment by Glenn Maynard [ 05/Oct/13 ]

Just to be clear, James's idea of updating subdocuments in-place is what would require BSON format changes, not the change this bug is about.

Comment by Glenn Maynard [ 05/Oct/13 ]

James: That sounds like it would require significant changes to the BSON data format, so it doesn't sound like an optimization that would ever happen. (Even simpler fixes, like not storing the index of each item in BSON arrays, are hard to get any traction on, since changing the BSON format affects every driver and probably third-party software.)

Comment by James Blackburn [ 05/Oct/13 ]

In 2.6, we're actually changing the update code so that it never re-orders, which should guarantee that the order that documents come in stay in that order.
If we were going to change, to make things reasonably efficient, we'd probably have to sort all keys so that comparison are fast.
I'm not sure how people would feel about that, opinions here?

I've got one strong counter-argument to this: efficiency of small updates to large documents. As things currently stand updating a document dirties all the document's pages, even if only a single fixed-length field is changed. For arrays this is worse: ss the document size changes subsequent fields move. This means multi-document updates can cause significant unnecessary I/O.

If instead mutated fields, and variably-sized fields were stored/moved to the end of the document then the update code could be easily optimised to touch just the page(s) containing the mutated fields at the end of the document. If this happened dynamically during update then mongo would be much friendlier to the underlying storage.

Comment by Daniel Sinclair [ 04/Oct/13 ]

Actually, you're correct about JS - I misspoke, I guess I was just thinking
of the way my application code manipulates fields in collections. The order
isn't considered important enough to preserve and once it's been through a
few constructor and mapper functions I wouldn't like to guarantee that
fields were in a specific order either.

Your point about the drivers I agree with. Sorting is OK, but I'd much
rather NOT have order dependency, and I don't think it's currently enforced
anywhere except the DB, which is unfortunate.

Indexes though was how I first found this thread. I wanted to use an object
field index and I just wanted to make sure that it would work the way I
expected - it doesn't. I don't think you can make an exception for indexes,
they're would surely have to work the same way, no?

If we can't guarantee order of fields from the client, then the DB needs to
able to cope whether indexed or not. I appreciate however that it's
probably not as efficient to build an index that's order independent, which
is why I floated sorting the field order. In order to make indexes work
without a consistent, predictable, sorted field order would be to generate
the indexes with sorted fields and query indexed fields with sorted fields.
That sounds like it would suck, so the alternative would be to enforce a
predictable order always, and the only way I can see a driver can guarantee
that is by alphanumeric sort, unless there's something else I've not
considered?

Comment by Eliot Horowitz (Inactive) [ 04/Oct/13 ]

Glenn, correct.

JavaScript maintains order and has no notion of document equality.
That's why we originally went this way, to maintain compatibility with javascript.

Daniel, all the drivers support building order dependent documents, but agree its probably not ideal.

Its just not totally clear what the right path forward is.

Going to move this ticket to put it near the top of things to think about.

Comment by Glenn Maynard [ 04/Oct/13 ]

Sorting the keys is fine. In most languages, dictionaries are inherently unordered, so having keys come back in a different order is OK. I prefer this, actually, since it helps guarantee that I won't accidentally write code that depends on the ordering behavior of a particular language/driver.

For people depending on order-sensitivity or order-preservation, an option would be to make this a collection-level flag to enable and disable this. It would probably be OK if changing this flag after collection creation wasn't allowed, so supporting migrating the data wouldn't be required. For example, since this flag would change compound shard keys, this restriction means upgrading wouldn't cause lots of documents to be moved across shards. On upgrade, set this flag for all existing collections; users would need to copy data to a new collection if they want to migrate existing data.

system.indexes does depend on order-preservation, so that collection could use that flag, avoiding the need to restructure it.

> fundamentally broken. By not supporting equivalence the way JavaScript does

Actually, JavaScript is one of the only languages I've used that does always preserve dictionary order. Every other major language I've used doesn't unless you jump hoops. (I don't think JS has any concept of object equality.)

Comment by Daniel Sinclair [ 04/Oct/13 ]

Presumably you can have a "strict" or "fast" mode for those that don't want
to switch. Sorting would be valid, since there are no guarantees on order
already, but I think the impact of NOT having dictionary equivalence is
fundamentally broken. By not supporting equivalence the way JavaScript does
and many other languages/drivers is surely more of a risk. It means that
any slight change in layout of a language level dictionary could cause hard
to find, catastrophic but possibly intermittent errors in almost any query.
I'm using the C# driver and to be honest I'm astonished I've not run into
this before now.

If we can't guarantee on the order of fields out of the driver (and I don't
think we should BTW, JavaScript doesn't expect that) then surely it's a
house of cards waiting to fall over since you can't safely construct any
nested object query reliably at all?

If you were going to keep to an order, then I'd have said sorted
alphabetically would be a good option rather than the order presented, but
you'd have to enforce that in all the drivers since the language may not be
able to consider that. Sorting fields alphanumerically is a good option if
you want to hash them.

I'd vote firstly for equivalence

{a:1,b:2}

===

{b:2,a:1}

, failing that,
driver enforced alphanumerical, but I don't think it can remain as it is. I
think we gotta just take the hit soon before people start running into
scary production issues.

Comment by Eliot Horowitz (Inactive) [ 04/Oct/13 ]

We've gone back and forth on this one a lot over the last couple of years.

Many strong opinions on both sides.

In 2.6, we're actually changing the update code so that it never re-orders, which should guarantee that the order that documents come in stay in that order.

If we were going to change, to make things reasonably efficient, we'd probably have to sort all keys so that comparison are fast.
I'm not sure how people would feel about that, opinions here?

Also not sure how to migrate if we did want to change.

Comment by Daniel Sinclair [ 04/Oct/13 ]

I also think this is a problem;

db.xxx.drop();
db.xxx.insert({id: {bar:2, foo:1}});
db.xxx.insert({id: {foo:1, bar:2}}); // should be equivalent
db.xxx.find({ id: {bar:2, foo:1}}); // only returns the first one

"MongoDB does not make guarantees regarding the order of fields in a BSON document. Drivers and MongoDB will reorder the fields of a documents upon insertion and following updates."
http://docs.mongodb.org/manual/core/document/

Which is clearly only true for top-level fields. However, id is also considered a BsonDocument.

Comment by Glenn Maynard [ 27/Sep/12 ]

It happens everywhere subdocuments are compared against each other; from what I recall it's actually just a byte-for-byte comparison of the BSON data, without even looking at its contents.

Comment by Caleb Jones [ 27/Sep/12 ]

Note that this occurs with $addToSet:

> db.test.insert({ "_id" : 1, "array" : [ {"f1" : 1, "f2", 2} ] })
> db.test.find()
{ "_id" : 1, "array" : [ {"f1" : 1, "f2", 2} ] }
> db.test.update({ "_id" : 1 }, { "$addToSet" : { "array" : { "f2" : 2, "f1", 1 } } })
> db.test.find()
{ "_id" : 1, "array" : [ {"f1" : 1, "f2", 2}, { "f2" : 2, "f1", 1 } ] }

Comment by Glenn Maynard [ 18/Sep/12 ]

(Why is a bugtracker hideously image-izing smileys? Who seriously thought that was a good idea?)

Comment by Glenn Maynard [ 18/Sep/12 ]

Because it doesn't solve the problem.

Comment by Julien [ 18/Sep/12 ]

Hum ... when you say : "For example, if you insert a record manually in the shell, you need to remember to apply the same ordering" it's true, with AND without my solution ...

Comment by Glenn Maynard [ 18/Sep/12 ]

That's brittle: you have to make sure you apply something like this in every place you access the database. For example, if you insert a record manually in the shell, you need to remember to apply the same ordering. If you forget, or get it wrong (eg. if the sort order isn't identical for Unicode characters), you get subtle breakage. Avoiding brittleness is a major reason to want this in the first place.

This also doesn't work when setting subkeys of an object, rather than replacing the whole object.

The only way I can think of to fix this properly is a collection-level flag.

Comment by Julien [ 18/Sep/12 ]

Or a SON ctor from a dict, something like :

 
from operator import itemgetter
 
def dict2SON(values):               
    if not isinstance(values, dict):
        return values                        
    return SON(sorted([(key, dict2SON(value))
                       for key, value     
                       in values.items()],
                      key=itemgetter(0)))
 

Does it seems a good idea ?

Comment by Julien [ 18/Sep/12 ]

And what about providing a example code of an "Ordering key SON manipulator" in drivers ?
It's not intrusive, and solve the problem for those who want to use it.

Comment by Glenn Maynard [ 13/Sep/12 ]

It would have been good to do that from the start, but unfortunately, MongoDB currently guarantees preservation of dictionary order. Applications probably depends on that, so normalizing dictionaries would probably cause backwards-compatibility issues.

Comment by Julien [ 13/Sep/12 ]

I got an idea about this problem, what about ordering by key, server side, at insert time ? It would hide the "order sensitive" problem, leaving a natural, order insensitive database, without dropping any performance nor touching any loc of the storage engine ...

Generated at Thu Feb 08 03:07:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.