• Sharding

      Consider the following:

      conversation
      {
      users: [id1, id1, id3, ...],
      ...other conversation data...
      }

      Displaying a list of a single user's conversations with no further restrictions is impossible without querying every shard. Even restructuring the collection doesn't fix the problem.

      This can be worked around at the application level by creating and maintaining a separate collection with one entry per user, but that isn't very elegant. If the user needs to be able to arbitrarily filter their list of conversations, this gets worse, because most or all of the data needs to be available at query time, and therefore a large amount of data needs to be duplicated per user.

      When duplicating at the code level, it is necessary to create one duplicate per entry, regardless of whether they actually get placed on different shards, because they MAY get placed on different shards, or may get rebalanced to another shard later. Sharding based on array contents would still require duplication sometimes, but it could be greatly reduced, and may not require any duplication at all if all entries in the array resolve to the same shard.

      The logical implementation for this is actually fairly straightforward:

      Insert:
      1. look at the elements in the array and determine which shards are within range of any of the elements
      2. Insert the record on each shard

      Update:
      1. look up the complete sharded array from any copy of the record using the provided shard key
      2. If the sharded array is being modified, determine whether the list of shards it resides on will change, and remove from or insert to those shards as needed.
      3. Update the record on all shards

      Delete:
      1. look up the complete sharded array from any copy of the record using the provided shard key
      2. remove from all shards it resides on

      I don't know for sure whether or not this would complicate re-balancing, but I don't think so. Unless I've missed something you SHOULD be able to treat each value as effectively distinct for this. When you split the chunk, just split the records as needed. The catch here is that the actual gains may be somewhat unpredictable, especially when the split was inspired by high disk space use. In any case though, it couldn't be any worse than having to duplicate everything all the time, even when it isn't needed.

            Assignee:
            backlog-server-sharding [DO NOT USE] Backlog - Sharding Team
            Reporter:
            bugslayer John Crenshaw
            Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: