[SERVER-10220] Support hashed fields in compound indexes and compound shard keys Created: 16/Jul/13  Updated: 06/Dec/22  Resolved: 23/Jan/20

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Sharding
Affects Version/s: None
Fix Version/s: 4.3.3

Type: Improvement Priority: Major - P3
Reporter: daniel.roberts@10gen.com Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Done Votes: 42
Labels: indexing, sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-11338 GridFS hash base sharding support Closed
is duplicated by SERVER-32657 Sharding GridFS has write bottleneck Closed
Related
Assigned Teams:
Sharding
Backwards Compatibility: Fully Compatible
Participants:
Case:

 Description   

Provide ability to have hashed fields in compound indexes.

For example:

db.collection.ensureIndex({a : 'hashed', b : 1})

Required for compound shard keys where one of the fields needs to be hashed for even distribution across the cluster.



 Comments   
Comment by Craig Homa [ 23/Jan/20 ]

Hey louisa.berger, this was done as part of the Compound Hashed Shard Key epic (PM-241), which will be included in 4.4. Please let the Query team know if you have any questions.

CC bernard.gorman

Comment by Louisa Berger [ 23/Jan/20 ]

craig.homa Is this planning to be included in 4.4?

Comment by John Page [ 15/Mar/18 ]

This also benefits from allow you to severely restrict the number of bits/range of values in the hash - so

{userid:"Hashed:500"}

allowing to hash between 0 and 499.

This avoids the issues with random values in btrees blowing out your I/O and also allows active management of chunk moves when provisioning new servers.

Comment by Adam Flynn [ 29/Jun/15 ]

This feature is high on our wishlist as well. We have a number of collections that naturally shard by a key like user ID (ObjectId). In many of these collections, the number of documents per user is typically small but technically unbounded (often monotonically growing). The largest/oldest users in these cases can create jumbo chunks.

To prevent these rare jumbo chunk cases, we need to add more granularity to the shard key, say _id. But since most writes happen for new users, we need user_id to be hashed for even write distribution. So, our ideal shard key would be {user_id: "hashed", _id: "hashed"} or {user_id: "hashed", _id: 1} (limiting compound indexes to a single hashed key would be fine in this use case, since user_id has enough cardinality that _id won't materially impact write distribution).

Right now, our workaround options are:

  1. Use {user_id: "hashed"} and hope we don't see jumbo chunks, which is obviously dangerous.
  2. Use {user_id: 1, _id: 1} and try to manage the hot chunks, which ranges from a small annoyance to unfeasible depending on write volume & distribution.
  3. Use {hashed_user_id: 1, _id: 1} and store a hash in the document, which turns user_id queries into scatter/gathers (expensive at 40+ shards) or requires hashed_user_id in every query spec (annoying to clutter the application with this, especially in cases fetching multiple users).

Putting this feature in MongoDB would let us side-step a lot of jumbo chunk problems without a lot of application overhead or write distribution issues.

Comment by Gagan Jain [ 01/Jun/15 ]

Hi Mongo team,

Any ETA on this?

Thanks & regards,
Gagan

Comment by Nic Cottrell (Personal) [ 18/Aug/14 ]

I'd love to have this field. I have a collection which is a corpus of extracted sentences. I have a "t" field which is a long text (>512) which often contains Arabic etc. so too long to have a normal index (with the new 1024 hard limit) but also a "lc" (language code) field. It would save a lot of BSON processing if I could have an index on

{t:"hashed", lc:1}

Generated at Thu Feb 08 03:22:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.