[DOCS-235] Shard Key Selection/Design Tutorial Created: 19/Jun/12  Updated: 30/Oct/23  Resolved: 28/Jan/16

Status: Closed
Project: Documentation
Component/s: manual
Affects Version/s: None
Fix Version/s: Server_Docs_20231030

Type: Task Priority: Major - P3
Reporter: William Zola Assignee: Unassigned
Resolution: Done Votes: 0
Labels: sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:
Days since reply: 8 years, 2 weeks, 6 days ago

 Description   

Create a document in /tutorial/choose-a-shard-key

Link to it on sharding page and the tutorial.txt page

This should be the core place for information on sharding.



 Comments   
Comment by Allison Reinheimer Moore [ 28/Jan/16 ]

It would appear that this was done a while back!

https://docs.mongodb.org/manual/tutorial/choose-a-shard-key/

Comment by William Zola [ 19/Jun/12 ]

Wups! Not familiar with creating a ticket in Jira. Here's the text section again:

== Common Shard Key Design Patterns

Here are three common design patterns that folks often use when picking a shard key:

A) Hashed tag value

This pattern is used when you're only concerned with optimizing write performance. You take a known-unique property of the document it, run it through a hash function (MD5 is commonly used), store that as a property of the document, and use it as the shard key.

The main advantages of this pattern are guaranteed good Write Distribution and the ease of coding it. The main disadvantage of this pattern is that it guarantees that you will have poor Read Isolation on any range queries.

B) Common Search Key and Timestamp

In this pattern, you use a compound shard key. The first portion is a search key that your application will often use, and the second portion is a timestamp. Using the email messages example from above, your shard key might be something like "key:

{ email_id: 1, timestamp: -1 }

".

This pattern can often balance between good write performance and read performance, especially if data is coming in in timestamp order.

If there are enough distinct values for the low-cardinality key (4x the number of anticipated shards is a reasonable lower bound) then you will most likely get good write distribution. In addition, the presence of the timestamp will allow MongoDB to split the chunks with fine granularity if necessary. Finally, the presence of the timestamp will help Read Isolation – emails to a single user within a specific time range are likely to be in a single chunk, and therefore on a single shard.

The main disadvantage of this pattern is that you will have Scatter/Gather queries if you don't query by the common search key. For example, "find all emails sent two days ago by all users" will result in a scatter/gather query (since the email ID isn't part of the query).

C) Coarsely Increasing Counter and Common Search Key

This pattern also uses a compound shard key, but in a slightly different way. In this pattern, you use a coarsely-increasing counter as the first portion of the shard key. For example, instead of a timestamp with millisecond granularity, you might choose only the date (year-month-day) portion of the timestamp. If you're getting many inserts within a single day, this portion of the shard key will group them all into a single bucket.

For the second portion of the shard key, you'd use a frequently-used search key, as in the above pattern. Applying this pattern to the email example, you'd get something like "key:

{ts_day: -1, email_id: 1}

"

This pattern also balances between read performance and write performance but does so in a different way. Using the coarsely-increasing counter tends to group new documents into a single shard, but adding the search key at the end allows them to split into separate chunks and be distributed onto multiple shards. This allows for good Read Isolation, when your most common range query is by timestamp, but also allows the write load to be distributed if a single shard tends to get 'hot'.

In general, this pattern slightly optimizes read performance over write performance. The main disadvantage of this pattern is that you're likely to get a 'hot' shard for a while when the coarsely increasing counter rolls over to the next unit – for example, when the 'ts_day' shifts at midnight.

Generated at Thu Feb 08 07:38:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.