[SERVER-2065] Sharding on arrays Created: 05/Nov/10  Updated: 06/Dec/22  Resolved: 02/Dec/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: features we're not sure of

Type: New Feature Priority: Major - P3
Reporter: John Crenshaw Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Won't Do Votes: 2
Labels: tommaso-triage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-3586 Array as shard key value should be pr... Closed
Assigned Teams:
Sharding
Participants:

 Description   

Consider the following:

conversation
{
users: [id1, id1, id3, ...],
...other conversation data...
}

Displaying a list of a single user's conversations with no further restrictions is impossible without querying every shard. Even restructuring the collection doesn't fix the problem.

This can be worked around at the application level by creating and maintaining a separate collection with one entry per user, but that isn't very elegant. If the user needs to be able to arbitrarily filter their list of conversations, this gets worse, because most or all of the data needs to be available at query time, and therefore a large amount of data needs to be duplicated per user.

When duplicating at the code level, it is necessary to create one duplicate per entry, regardless of whether they actually get placed on different shards, because they MAY get placed on different shards, or may get rebalanced to another shard later. Sharding based on array contents would still require duplication sometimes, but it could be greatly reduced, and may not require any duplication at all if all entries in the array resolve to the same shard.

The logical implementation for this is actually fairly straightforward:

Insert:
1. look at the elements in the array and determine which shards are within range of any of the elements
2. Insert the record on each shard

Update:
1. look up the complete sharded array from any copy of the record using the provided shard key
2. If the sharded array is being modified, determine whether the list of shards it resides on will change, and remove from or insert to those shards as needed.
3. Update the record on all shards

Delete:
1. look up the complete sharded array from any copy of the record using the provided shard key
2. remove from all shards it resides on

I don't know for sure whether or not this would complicate re-balancing, but I don't think so. Unless I've missed something you SHOULD be able to treat each value as effectively distinct for this. When you split the chunk, just split the records as needed. The catch here is that the actual gains may be somewhat unpredictable, especially when the split was inspired by high disk space use. In any case though, it couldn't be any worse than having to duplicate everything all the time, even when it isn't needed.



 Comments   
Comment by Eliot Horowitz (Inactive) [ 28/Jan/12 ]

@julian - you would never get more than 1 copy of a document

Comment by Julian Coleman [ 27/Jan/12 ]

How would this affect find()? Would more than one result be returned when there is a match on more than one element in the array ($in)?

Currently I duplicate entires for documents that are 'imitating' sharding on an array, but I have to identify duplicates when searching the collection and remove them at the 'code level' (at least until distinct() is a cursor ).

Even if the gains in disk space are marginal this could be a good idea to simplify searching as well.

Comment by Mathias Stearn [ 16/Aug/11 ]

If we decide to support this in the future will need to undo restrictions in SERVER-3586

Generated at Thu Feb 08 02:58:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.