Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-11991

Exclude cluster keys from shard keys in queries



    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Won't Fix
    • Affects Version/s: 2.4.8
    • Fix Version/s: None
    • Component/s: Sharding
    • Labels:
    • Operating System:


      Consider this use case:

      We have a global cluster with region-tagged shards (think NY and LA). We'd like to shard our data by region and another key so it's distributed across the shards in the local region. I believe we have a scheme to do this, but it's complicated by some of our optimized queries like this one:

      # Python
      # want to query distinct ages for a given group
      coll.ensure_index([('group', 1), ('age', 1)])
      query = {"group": group, "age": age}
      # Exclude the _id from the query to load the result from the index.
      fields = {"_id": 0, "age": 1}
      ages = coll.find(query, fields=fields).distinct("age")

      As you can see, this query is highly optimized to load the results from the covered index. Because the index is present on group and age, this query works great in a non-sharded environment.

      However, if we enable sharding on "region" and "group", we'll need to include the 'region' in the query to signal to mongos to use the local, fast shard(s). But if 'region' is in the query that goes to a shard, it will invalidate the use of the covered index (because region is not in that index).

      One could make sure the shard key is in every index, but that seems like a hack, and is conflating the primary purpose of the index with a secondary purpose (the 'region' key is irrelevant to the query once the shard is selected).

      Ideally, one would like a way to have a way to query in this way:

      use this query for selecting the shard:

              {"region": "NY", "group": "bankers", "age": {"$lt": 30}}

      but then use this query on each shard:

              {"group": "bankers", "age": {"$lt": 30}}

      I believe it would be sufficient in all cases to simply allow removing select shard keys from the query at mongos before it's relayed to the shards (in this case, the 'region' key).

      I propose the addition of a parameter to the find operation to provide keys to be used for routing only, something like:

      coll.find(query, shard_hint={'region': 'NY'})

      When a query like this is encountered, MongoDB will extend the query with the shard_hint when doing routing in mongos, but will use the simple 'query' in mongod.

      This implementation would be backward-compatible, and the mongos routing would still honor the query when doing routing. It would be an error to supply different values for the same key in both the query and shard_hint.




            backlog-server-sharding Backlog - Sharding Team
            jason.coombs@yougov.com Jason R. Coombs
            5 Vote for this issue
            11 Start watching this issue