[SERVER-42237] Allow shard key to be prefix of multikey index Created: 15/Jul/19  Updated: 06/Dec/22  Resolved: 28/Jun/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Renato Riccio Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-20857 dataSize command requires shard key, ... Closed
Assigned Teams:
Sharding EMEA
Sprint: Sharding 2019-08-26
Participants:

 Description   

Currently is not possible to use as shard key a prefix of a multikey index.

For this reason, even if we have as shard key {a:1} and we have already the index {a:1, b:1} that is considered multikey ONLY because the field b is an array, we must have an index on {a:1} or a non multikey index where a is a prefix.

Allowing to use {a:1, b:1} for sharding will allow to save RAM, space and write overhead.



 Comments   
Comment by Pierlauro Sciarelli [ 28/Jun/22 ]

Thanks to max.hirschhorn@mongodb.com for pointing out that the query subsystem is responsible for deduplicate by RecordId when the IndexScan is scanning over a multikey index. The above mentioned bug does not exist, closing the ticket

Comment by Pierlauro Sciarelli [ 07/Jun/22 ]

Reopening the ticket because it is not allowed to shard a collection with only multikey indexes, however there is this faulty flow:

  1. shardCollection succeeds because there is a proper index
  2. The original shard key index is dropped and only a multi-key index with the shard key as prefix remains available
  3. From now on, some sharding machinery (e.g. auto-splitter and migration cloning) starts using the multikey index

(check usages of requireSingleKey)

A shard key index scan has to skip over these extra index entries, and de-duplicate them to ensure that each relevant document isn't handled repeatedly

Provided that this observation is still true, that would mean that we have a bug since we are not handling this case in the code.

Comment by Ratika Gandhi [ 05/Sep/19 ]

Good idea but other higher priority items on the backlog to tackle. 

Comment by Kaloian Manassiev [ 02/Aug/19 ]

To be discussed at the next eng/product sync meeting.

Comment by Kevin Pulo [ 30/Jul/19 ]

Historically, this restriction is because in older versions the entire index was marked multikey, and it wasn't possible to tell if that was because of field a or b. I believe that is no longer the case in current versions, so this should be possible.

To expand on this a little, there are two issues here.

  1. A prefix index on fields A + B (even when B is non-multikey) will be less efficient than an exact match shard key index on just A, because the index entries include values for the other fields (B), making them bigger. In extreme cases this can make the index much bigger, and can slow down the shard key index scan. The index might also have more contention if the B values are updated often.
  2. Multikey indexes make this problem worse, because now each document (with a single A value) has multiple "irrelevant" index entries as a result of the array values on the B field. A shard key index scan has to skip over these extra index entries, and de-duplicate them to ensure that each relevant document isn't handled repeatedly. I believe the query subsystem should handle this, but it's extra work, making the scan slower. The index will also be bigger as a result of these extra index entries. The exact match shard key index on just A would not have these extra index entries (in addition to not having any of the B values at all, as above).

Another aspect is that it's currently possible to get into this state by sharding with a non-multikey A + B index, which then later becomes multikey on B. The inverse situation of sharding with an A + (multikey)B index is not possible, despite being equivalent. Similarly, it's possible to shard with an A index and A + (multikey)B, and then drop the A index.

The issue here is that these combinations of factors, mixed in other relevant factors like the rest of the query workload, index/data/storage sizes, etc, means that it may be hard to know the exact set of circumstances where having the A + (multikey)B index would be an overall win, compared to having both the A + (multikey)B and the solo A index (and vice-versa where it would be an overall loss). Which doesn't necessarily mean we shouldn't do it, I just want to make sure we're aware of the issues here.

Generated at Thu Feb 08 04:59:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.