Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-89845

Collection validation can interact poorly with multikey data containing equivalent values

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Storage Execution
    • ALL
    • Execution Team 2024-05-27

      Consider an index on {a: 1} and document {a: [NumberInt(5), NumberLong(5)]}. The two values for a are equivalent but have distinct types. This means that the two index keys generated for this document will be identical but with different type bits. But, since there can only be one index key (ignoring type bits) per record, one of these keys gets discarded. Thus we're left with just one index key with the type information from just one of the values – the type information from the other value gets lost. Query-wise this is okay since we don't allow multikey indexes to serve covered query plans. But it does have some implications when it comes to collection validation, in particular for data which was inserted on v4.4 or earlier and validated on 5.0+.

      One implication is that collection validation may need to run both phases when data of this format is present. (Usually, when there is no index corruption then only the first phase is required.) This is because when we hash the index keys, we do so with the type bits included. The implication here is that collection validation ends up relying on the same type bits to be preserved every time index keys are generated for a given document. (In particular this changed in SERVER-47349 between v4.4 and v5.0, but more generally this is not a property which should be relied upon.) Connecting this back to the example above, say the document's index keys originally preserved the type for NumberLong(5) upon insertion but the collection validation preserves the type for NumberInt(5) when generating the index keys. Subsequently the first phase of collection validation will report that there are index inconsistencies since they hash differently, requiring the second phase to run. This ends up not being reported to the user since it will be reconciled during the second phase, but it does require validation to do extra work.

      The second implication is a caveat to the above in which we do end up falsely reporting an index inconsistency to the user. This requires us to fall into the case where we zero out buckets to remain under the memory limit. If this happens, the aforementioned reconciliation during the second phase cannot happen since one of the incorrectly-non-zero buckets has been cleared. This, the one non-zero bucket will end up getting reported as either an extra or a missing index entry. Note that this was previously exacerbated by SERVER-86407 since we would incorrectly fall into this case much more easily than we should have.

            Assignee:
            Unassigned Unassigned
            Reporter:
            gregory.noma@mongodb.com Gregory Noma
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: