Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-98022

Improve affected index determination algorithm

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Execution

      While working on SERVER-76390 we realized that we can likely optimize the algorithm to determine which secondary indexes are affected by an update.

      The indexes to update are currently determined by this method

      IndexSet IndexUpdateIdentifier::determineAffectedIndexes(const Diff& diff) const:
      

      That method takes an document Diff and walks through all contained changes recursively.

      For each change found, the following method is called:

      void IndexUpdateIdentifier::determineAffectedIndexes(const FieldRef& path,
                                                           IndexSet& indexesToUpdate) const;

      This method walks over all paths backed by any index:

          for (const auto& indexedPath : _canonicalPathsToIndexSets) {
              if ((indexesToUpdate & indexedPath.second) == indexedPath.second) {
                  // Already handled all the indexes for which we have canonical paths.
                  // This is an important speedup.
                  continue;
              }
              if (FieldRef::pathOverlaps(path, indexedPath.first)) {
                  dassert(indexesToUpdate.size() == indexedPath.second.size());
                  indexesToUpdate |= indexedPath.second;
              }
          }

      This can be quite expensive for large document Diffs objects.

      I think this change would benefit when the changes are on the first level fields too by replacing iteration over all indexes with a dictionary lookup.

      For example, consider the case in which the document Diff contains changes to top-level fields, e.g.

      {"foo":1,"bar":2,"baz":3}

      For every top-level field in the document diff, we are currently iterating over all available indexes. We could instead do a targeted lookup for indexes covering the respective field, which would avoid iterating over non-matching indexes.

      As another example, consider the case in which the document Diff contains changes to multiple nested fields, e.g.

      {"name":{"first":..., "middle":..., "last":...}}

      In this case we have 3 nested fields underneath the top-level `name` field.

      Currently we need to loop over all available indexes for each nested field, even if there is no index on `name` or any of its nested fields. We likely waste CPU cycles here.
       

      It could be more efficient to store the index field information in a hierarchical structure, so that when iterating over a document diff, we

      • can perform a quick lookup for indexes covering the top-level field from the document diff
      • only recurse deeper in case into nested field diffs in case we found a match on the level above

      For example, if we the index field information would be stored in a way that allows us to check if there is an index on the `name` field or any of its nested fields, then we only need to recurse into the diffs for "name" if such index exists. If no such index exists, there is no need to recurse, and we can skip the checks for all the nested fields. Potentially we will need to allow a quick look for each part of an index field path:

      For example, let's assume there are the following indexes:

      {"a":1}
      {"foo.bar":1}
      {"foo.baz":1}
      {"baz.qux":1}

      Let's assume there is the following document diff:

      {"b":2,"a":3","d":{"foo":1,"bar":2},"foo":{"bar":3}}

      We could now store the index field information as follows:

      "a"            => {"a":1}
      "foo" => "foo" => {"foo.bar":1}
      "foo" => "baz" => {"foo.baz":1}
      "baz" => "qux" => {"baz.qux":1}

      When processing the document diff, we can act as follows:

      {"b":2}
      // check if top-level field "b" is indexed. As it isn't, we can skip this field.
      
      {"a":3}
      // check if top-level field "a" is indexed. It is, so we add index {"a":1} to the result.
      
      {"d":{...}}
      // check if top-level field "d" is indexed. As it isn't, we can skip this field and all its nested fields ("foo", "bar").
      
      {"foo":{"bar":3}}
      // check if top-level field "foo" is indexed. It is, so we recurse into the diff and check if the nested field "foo.bar" is indexed. It is, so we add index {"foo.bar":1} to the result.

      In order to facilitate that, we will need a recursive data structure sorted or hashed by field name parts. Which exact data structure to use for this will need to be evaluated.

      Note: if there are text indexes present, we can not perform this optimization. The reason is that the index determination algorithm for text indexes is different to the algorithm for regular indexes.

            Assignee:
            Unassigned Unassigned
            Reporter:
            jan.steemann@mongodb.com Jan Steemann
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: