-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Aggregation Framework
-
None
-
Query Integration
-
Fully Compatible
Background
In addressing a legitimate performance concern (excessive RTTI calls), SERVER-90980 introduced the DocumentSourceType enum class, which is a centralized list of all possible DocumentSource child classes, along with the pure virtual DocumentSource::getType() member function (which returns DocumentSourceType).
Problem
Unfortunately, this approach is a modularity anti-pattern(*), and a backwards step for a more modular core server, because it tightly couples the base DocumentSource with the FULL SET of all its child classes. (Normally the coupling is only between each individual child class and its base class, ie. the child classes remain independent of one other.) Having such a fixed list of possibilities defeats many of the advantages of using an object-oriented pattern in the first place.
We should be trying to remove these fixed lists from the server codebase wherever possible, and avoid adding any new occurrences.
Specifically:
- It's now harder to add new DocumentSources, because doing so requires also updating the DocumentSourceType enum in document_source.h. In particular:
- Doing so concurrently by multiple engineers can cause merge conflicts in or around the enum, decreasing developer productivity. Rebasing of feature branches, and backports, are similarly impacted — code changes that are logically independent (eg. the addition of two completely unrelated DocumentSources), and which therefore should rebase / cherry-pick cleanly, instead end up requiring manual effort.
- Adding a new DocumentSource child class should only require compiling that new class, and then relinking the binaries. However, updating the enum in document_source.h means that all DocumentSources will have to be recompiled, along with a significant fraction of the rest of the server (aside from child classes, document_source.h is included by 36 other cpp files and 31 other header files).
This results in increased actual development costs, both in developer time (because local incremental builds take longer than they should), and in build farm CPU compute time / cost. The single common source location also negatively impacts build artefact caching, which means that even "full" builds (eg. patch builds and commit queue builds, as opposed to local incremental developer workstation builds) will be slower as a result.
This could be improved by moving the enum into its own header file (and having just a forward declaration in document_source.h), but the enum will still couple all the DocumentSources together, requiring all DocumentSources to be recompiled whenever the set of DocumentSources changes. So this would only be a partial solution, and it would be better to properly fix both this aspect, as well as those above / below.
- It's now harder to do experimental / exploratory work on the server, where additional experimental aggregation stages are temporarily added to the server. Previously, this was possible just by linking in the extra DocumentSource child classes at link time (since DocumentSource registration is done dynamically at server startup). With DocumentSourceType, adding even a trivial agg stage now requires either a full custom build, or else mucking around with divergent DocumentSourceType enum class definitions, and/or dodgy static_cast<DocumentSourceType>s.
- Doing so concurrently by multiple engineers can cause merge conflicts in or around the enum, decreasing developer productivity. Rebasing of feature branches, and backports, are similarly impacted — code changes that are logically independent (eg. the addition of two completely unrelated DocumentSources), and which therefore should rebase / cherry-pick cleanly, instead end up requiring manual effort.
- It's now harder to reuse a subset of DocumentSources, or the DocumentSource framework itself, outside the server. This is a problem because it encourages continued proliferation of competing alternate aggregation execution engines, which is detrimental to a consistent user experience.
- There are no automated tests to check that each DocumentSourceFoo::getType() returns the correct enum value. Since new DocumentSources are often added by copy-pasting from a similar pre-existing DocumentSource, this becomes yet another piece of boilerplate that will cause subtle bugs if it is accidentally not updated. (This even happened during the
SERVER-90980code review.)
Solutions
A unique numeric id could be statically assigned to each DocumentSource child class. However, even if there was no single centralized list of ids, there would still be conflicts when independent DocumentSources are added concurrently by different developers. It would also still be difficult or impossible to choose guaranteed-unique ids for any experimental "out of tree" DocumentSource classes.
Since the purpose is to be able to quickly and easily determine the type of a given arbitrary DocumentSource object (without using dynamic_cast), the actual specific id values don't matter. They don't need to be permanently assigned or globally unique. Rather, they just need to be unique within a given server process, throughout its lifetime. This suggests using a dynamic id allocation method, similar to the dynamic registration system already in place for DocumentSources.
Such a system is straightforward and not overly complex. The required boilerplate is similar, and doesn't require changes beyond the DocumentSource child class's own h and cpp files. Since the class ids are dynamically assigned, and so not constexpr, they can't be used as case targets inside switch statements. However, the switch statements can easily be converted into if-elseif ladders containing only integer comparisons, which are only negligibly more expensive than a compile-time jump-table. (The important thing is that there remains only 1 virtual method dynamic dispatch, and no RTTI.)
- is related to
-
SERVER-99422 Generalize DocumentSource::Id and ALLOCATE_DOCUMENT_SOURCE_ID
- Backlog