[SERVER-2001] option to hash shard key Created: 25/Oct/10 Updated: 23/Nov/17 Resolved: 17/Oct/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 2.3.0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Eliot Horowitz (Inactive) | Assignee: | Kevin Matulef |
| Resolution: | Done | Votes: | 39 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Comments |
| Comment by auto [ 20/Dec/12 ] |
|
Author: {u'date': u'2012-12-20T20:59:05Z', u'email': u'matulef@10gen.com', u'name': u'Kevin Matulef'}Message: |
| Comment by auto [ 16/Oct/12 ] |
|
Author: {u'date': u'2012-10-16T09:01:41-07:00', u'email': u'matulef@10gen.com', u'name': u'Kevin Matulef'}Message: |
| Comment by auto [ 16/Oct/12 ] |
|
Author: {u'date': u'2012-10-16T03:08:17-07:00', u'email': u'matulef@gmail.com', u'name': u'Kevin Matulef'}Message: Queries on mongod must filter out documents that do not currently belong |
| Comment by auto [ 16/Oct/12 ] |
|
Author: {u'date': u'2012-10-15T21:06:26-07:00', u'email': u'matulef@gmail.com', u'name': u'Kevin Matulef'}Message: Hashed index keys are expected to have empty field names. A previous commit |
| Comment by auto [ 16/Oct/12 ] |
|
Author: {u'date': u'2012-10-16T00:26:20-07:00', u'name': u'Kevin Matulef', u'email': u'matulef@gmail.com'}Message: For ops that occur during a migration, mongod uses the isInRange |
| Comment by auto [ 15/Oct/12 ] |
|
Author: {u'date': u'2012-10-15T13:12:18-07:00', u'name': u'Kevin Matulef', u'email': u'matulef@10gen.com'}Message: In sharding, given a key expression like {a : 1}or {a : -1} we |
| Comment by auto [ 15/Oct/12 ] |
|
Author: {u'date': u'2012-10-14T23:25:12-07:00', u'name': u'Kevin Matulef', u'email': u'matulef@gmail.com'}Message: The KeyPattern class is an abstraction for defining more general |
| Comment by auto [ 15/Oct/12 ] |
|
Author: {u'date': u'2012-10-14T20:51:00-07:00', u'name': u'Kevin Matulef', u'email': u'matulef@gmail.com'}Message: |
| Comment by auto [ 10/Oct/12 ] |
|
Author: {u'date': u'2012-10-10T15:10:13-07:00', u'email': u'matulef@10gen.com', u'name': u'Kevin Matulef'}Message: This changes the top-level shard key validation to allow shard keys . It also adds some helper functions for |
| Comment by auto [ 10/Oct/12 ] |
|
Author: {u'date': u'2012-10-10T14:59:42-07:00', u'email': u'matulef@10gen.com', u'name': u'Kevin Matulef'}Message: To allow more complex index types, modifiedRangeBound needs to make sense that don't have 1 or -1 for the |
| Comment by auto [ 10/Oct/12 ] |
|
Author: {u'date': u'2012-10-10T12:50:46-07:00', u'email': u'matulef@10gen.com', u'name': u'Kevin Matulef'}Message: Validation of proper shard keys is now done at the top-level, so there will not have numeric field vals. |
| Comment by auto [ 04/Oct/12 ] |
|
Author: {u'date': u'2012-10-04T14:40:47-07:00', u'email': u'matulef@gmail.com', u'name': u'Kevin Matulef'}Message: |
| Comment by Remon van Vliet [ 04/Oct/12 ] |
|
@Kevin Okay, makes sense. And yes both are debatable (the second issue is flat out broken really |
| Comment by Kevin Matulef [ 03/Oct/12 ] |
|
remonvv 3 and 3.0 currently compare as equal in our query language (try db.foo.insert( {a:NumberLong(3)}) and db.foo.find( {a:3.0})) so we need to respect that (one could argue about whether it should be this way, but that's the way it is). Floating point issues were discussed extensively internally. Indeed, there are some tricky issues there. Floating point values that are too large to be represented as 64-bit ints will not be supported. {a:1, b:2}and {b:2, a:1}currently hash to different things. This is also consistent with our query language (try db.foo.insert({t : {a:1, b:2}}) and db.foo.find({t : {b:2, a:1}})). Again, debatable, but we're aiming for consistency. Thanks for the pointer to the incremental version of MurmurHash2! Will check it out. |
| Comment by Remon van Vliet [ 03/Oct/12 ] |
|
@Kevin. The most common incremental implementation is A2 : http://code.google.com/p/pyfasthash/source/browse/trunk/src/MurmurHash/MurmurHash2A.cpp. Whether or not it's relevant depends on how many cycles are burned on hashing in your implementation. I do not really understand why 3 and 3.0 should evaluate to the same hash. Wouldn't it be better to simply make sure the specification of BSON documents where that is relevant includes type? Apart from the binary incompatibility there are other issues. A lot of (higher) integer numbers cannot accurately be represented by IEEE 745 floating point values so such equality comparisons would have to include some sort of arbitrary epsilon range check to get properly squashed. On a related note, will {a:1, b:2}and {b:2, a:1} evaluate to the same hash as well? I suppose that issue is a bit broader than just |
| Comment by Kevin Matulef [ 02/Oct/12 ] |
|
The reason for the incremental computation is that there are certain values, such as the integer 3 and the floating point 3.0, that need to compare as equal, even though the binary that describes them is not the same. These values can exist in sub-documents or arrays. Thus, to compute the hash of an arbitrary BSON element, we need to recursively apply the hash function to sub-documents, squashing certain types as we go (the non-incremental way to compute this would involve rewriting the whole document with the types squashed, and then applying the hash, but that's not optimal). If you want to give a pointer to a good, incremental implementation of Murmur, I'd be happy to take a look. |
| Comment by Remon van Vliet [ 02/Oct/12 ] |
|
@Kevin. Sound reasoning, thanks for the explanation |
| Comment by Kevin Matulef [ 02/Oct/12 ] |
|
remonvv, yes, in the case of a collision, it will scan through the docs to find the exact match. With 64-bit hashes though, the number of collisions should be extremely small. One could indeed use an index that stores the hash as well as the original value, but that has the drawback that the index is larger (it also adds more complexity to the code). We're working on a more general index language, so that you could eventually make the hashed index also a covered index. But for now I think just storing the hash is a reasonable default. The hashed index type is primarily for sharding, but could also be used to index longer values, like long strings, with a smaller index (if you don't care about range queries). Potentially with a hashed index type we could also optimize the data structure under the hood to improve performance for random inserts. Murmurhash3 was (and is being) considered, but one issue is that we need to be able to compute the hash incrementally. Last I checked, the main implementation of Murmur was not written that way. It should be possible to implement, but we need any hash library we use to be tested and bulletproof. |
| Comment by Remon van Vliet [ 01/Oct/12 ] |
|
@Kevin, I understand, so it is assuming that the cardinality of the hashed field is so high that once you've done a lookup based on the hash value the amount of candidate documents matching that hash value is usually 1 or close to it? In other words, say I have documents A and B with a field F. If A.F and B.F are not equal but the hash value of A.F and B.F is equal will it then scan through A and B to find the exact match? I was assuming the implementation would do something like (under the hood) decorate the index with the hashed field (e.g. {username:1}-> {_h:1, username:1}) and the documents with a _h value and then proceed as usual. This, along with pre-splitting the chunks into ordered ranges of the 64-bit long hash, is what I currently do in our mapping library for our products at least and it works pretty well and doesn't involve new special index types. There are (significant) downsides to this of course; data and index size increases being the big one so this approach sounds better from that perspective. Are there any usecases for this index type other than sharding and perhaps index size reduction? I'm assuming the typical query speed advantage between index {username:1}and {username:"hashed"}is not that significant and only really starts to take effect on larger field values. Also, on a related note please consider using a higher throughput hash than MD5 such as Murmur. This would avoid needlessly burning cpu cycles on a heavyweight hash Thanks for the info! |
| Comment by Kevin Matulef [ 28/Sep/12 ] |
|
remonvv, the feature will not involve storing an extra "_hash" field in the documents, but it will involve indexing documents by the hash of a given field. In order to shard on the hashed value of the "username" field, for instance, you will first define a hashed index on the username field (e.g. db.coll.ensureIndex( {"username" : "hashed"}) which supports equality queries on username. You can then declare { "username" : "hashed" }as your shard key. Equality queries for a particular username will be routed to the appropriate shard, while queries for a range of usernames will be scatter-gather. Under the hood, the hash values will be represented as 64-bit ints, and the range of possible 64-bit ints will be pre-split into chunks. However the user shouldn't need to worry about the details of the implementation, or the actual hash values themselves. |
| Comment by Remon van Vliet [ 28/Sep/12 ] |
|
@Kevin, I see you've done some early work on this. Can you give us an update on the intended implementation? The commits I'm seeing for this issue seem to focus on implementing a special index type that supports hashing. I'm wondering how this ties in with |
| Comment by auto [ 18/May/12 ] |
|
Author: {u'login': u'matulef', u'name': u'Kevin Matulef', u'email': u'matulef@10gen.com'}Message: |
| Comment by auto [ 09/May/12 ] |
|
Author: {u'login': u'matulef', u'name': u'Kevin Matulef', u'email': u'matulef@gmail.com'}Message: |
| Comment by P Eger [ 13/Apr/12 ] |
|
Is this definitely going to make 2.1.X? |
| Comment by Andrew Armstrong [ 15/Mar/12 ] |
|
An interesting (off topic) feature from picking a shard key (such as Customer Id) would be the ability for MongoDB to later know that specific customer data is always on the same shard. So in theory later on, even in sharded clusters, MongoDB could do projections (even join-like behaviour) when querying customer-specific data, such as joining the Customer (with a specific Id) to their Purchase records etc. |
| Comment by Kevin Matulef [ 15/Mar/12 ] |
|
@Keaton, indeed, that's the idea, to allow users to specify a particular field, and have mongo shard automatically based on a hash of that field. It could be the _id field, which is unique to every document, and thus would distribute all the documents essentially randomly, or it could be another field like "username," which would spread out the users, but keep documents corresponding to the same user together. Depending on which field you hash, this could reduce or even eliminate the need for rebalancing (at a cost of losing some data locality). In your case, I think this is exactly what you want. This feature is in the works, just taking awhile since it touches a lot of the code. |
| Comment by Keaton Adams [ 14/Mar/12 ] |
|
I have strong interest in using MongoDB for a highly active production data store, but am concerned about the rebalancing work of the current sharding process. What I use today is a hash distribution on a key value to auto load-balance the data upon the initial write, instead of having to regularly re-balance the cluster as data is loaded. In this system data is copied in 24x7 in batches of 25K records at a time, with 50-60 concurrent writers. Total rows loaded is in the 200 to 250 million per day range, while users are running queries on the dataset. The current product is MVCC so writers don't block readers, and the table data is partitioned by a character string made up of a server name, timestamp and other identifiers to provide a high level of uniqueness with a very even distribution across the nodes in the cluster. The data is written to append-only compressed table partitions and is never updated after the initial write. To drop an entire "day" (200-250 million records), I simply drop the partition from the parent table, a maintenance operation that takes just a few seconds to complete. The product I currently use supports both hash and random distribution, but the hash key is more useful because users can provide the key value on a search to very quickly pull back a targeted row. Hash Distribution - With hash distribution, one or more table columns is used as the distribution key for the table. The distribution key is used by a hashing algorithm to assign each row to a particular segment. Keys of the same value will always hash to the same segment. Choosing a unique distribution key, such as a primary key, will ensure the most even data distribution. Hash distribution is the default distribution policy for a table. If a DISTRIBUTED clause is not supplied, then either the PRIMARY KEY (if the table has one) or the first column of the table will be used as the table distribution key. Thanks. |
| Comment by Remon van Vliet [ 04/Nov/11 ] |
|
Having uniquess be a mandatory property of hash values for the purpose mentioned here seems very limiting to me. Hashing in this way is only really useful for scenarios where people have pre-split/moved their data across a fixed amount of available shards. As such the hash should only have requirements that help facilitate determining the appropriate shard. Uniqueness is not mandatory or even helpful for this (other than the fact that it can then be used as _id value). Generating a stable, cluster unique hash based on some other piece of data that may not be unique across that same dataset is challenging at best anyway. The suggestion to make the feature more restrictive by limiting it to a fixed, reserved field name and type (32-bit signed integer seems perfectly fine to me, you need a predefined lower and upper bound for pre-splitting). For example, this is what we do : { Where the _hash is generated by a hashing algorithm with one of the document fields as it's input. I actually have a document with syntax and implementation suggestions if someone's interested. With this setup "_hash" is always your shard key so it simplifies things quite a bit on the shard administration end of things too. By the way, "<1 day" seems like an overly optimistic estimate if you ask me |
| Comment by Dwight Merriman [ 02/Oct/11 ] |
|
if the hash is not guaranteed unique (probably), it may make sense to use just a 32 bit value field to store the hash value. i suppose another approach is make it an md5 and put it in _id and shard on _id. that may be most efficient as an _id index will be around regardless. but only works if shard key is unique. |
| Comment by Grégoire Seux [ 22/Jun/11 ] |
|
Seems like a good idea : we implemented hashed shard key on client-side : operation rate sky rocked |
| Comment by Remon van Vliet [ 04/May/11 ] |
|
Great, works for me. I'm assuming version 1 will guarantee near perfect distribution for shard keys that are or partially composed of type ObjectId? That seems the most common usecase. |
| Comment by Eliot Horowitz (Inactive) [ 01/Apr/11 ] |
|
All information would be posted here. I don't think version 1 will have different hashing choices, just a good general one that produces very distributed values. |
| Comment by Remon van Vliet [ 01/Apr/11 ] |
|
Is there a place where I can find information regarding the planned implementation/discussion of this feature? The lack of hashing based read/write distribution amongst available shards is a huge issue for us now. We're actually considering implementing an app-side layer to do this but that obviously has a number of serious drawbacks. Also, will we be able to select various hashing strategies? Apart from the ones already mentioned (or at least linear hashing) I think libketama's continuum strategy might be a good fit as well in certain situations (http://amarok.kde.org/blog/archives/363-libketama-a-consistent-hashing-algo-for-memcache-clients.html) |
| Comment by Andrew Armstrong [ 07/Mar/11 ] |
|
I think its worth investigating an option to use linear hash (http://dev.mysql.com/doc/refman/5.1/en/partitioning-linear-hash.html) or other consistent hashing (http://wiki.membase.org/display/membase/vBuckets) so that adding/removing nodes in the cluster does not necessarily impact how much rebalancing needs to take place. |