[SERVER-2065] Sharding on arrays Created: 05/Nov/10 Updated: 06/Dec/22 Resolved: 02/Dec/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | features we're not sure of |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | John Crenshaw | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Won't Do | Votes: | 2 |
| Labels: | tommaso-triage | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Sharding
|
||||||||
| Participants: | |||||||||
| Description |
|
Consider the following: conversation Displaying a list of a single user's conversations with no further restrictions is impossible without querying every shard. Even restructuring the collection doesn't fix the problem. This can be worked around at the application level by creating and maintaining a separate collection with one entry per user, but that isn't very elegant. If the user needs to be able to arbitrarily filter their list of conversations, this gets worse, because most or all of the data needs to be available at query time, and therefore a large amount of data needs to be duplicated per user. When duplicating at the code level, it is necessary to create one duplicate per entry, regardless of whether they actually get placed on different shards, because they MAY get placed on different shards, or may get rebalanced to another shard later. Sharding based on array contents would still require duplication sometimes, but it could be greatly reduced, and may not require any duplication at all if all entries in the array resolve to the same shard. The logical implementation for this is actually fairly straightforward: Insert: Update: Delete: I don't know for sure whether or not this would complicate re-balancing, but I don't think so. Unless I've missed something you SHOULD be able to treat each value as effectively distinct for this. When you split the chunk, just split the records as needed. The catch here is that the actual gains may be somewhat unpredictable, especially when the split was inspired by high disk space use. In any case though, it couldn't be any worse than having to duplicate everything all the time, even when it isn't needed. |
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 28/Jan/12 ] |
|
@julian - you would never get more than 1 copy of a document |
| Comment by Julian Coleman [ 27/Jan/12 ] |
|
How would this affect find()? Would more than one result be returned when there is a match on more than one element in the array ($in)? Currently I duplicate entires for documents that are 'imitating' sharding on an array, but I have to identify duplicates when searching the collection and remove them at the 'code level' (at least until distinct() is a cursor ). Even if the gains in disk space are marginal this could be a good idea to simplify searching as well. |
| Comment by Mathias Stearn [ 16/Aug/11 ] |
|
If we decide to support this in the future will need to undo restrictions in |