-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Labels:None
One of the main memory expenses for data handles is statistics. In dist/stat_data.py under the dsrc_stats heading, are roughly 100 stats, I believe these cost 8 bytes each. Of the 100 or so stats, 70 of these look to be btree related, 10 for LSM, 23 for cursor ops. Tiered storage may need its own set of statistics. When we consider having many thousands of MongoDB collections, each with multiple indices, these numbers can add up. And given its life cycle, a data handle may persist for some time after activity in a collection goes dormant.
We might consider some ways to restructure this. One idea is to store stats not directly in the data_handle, but in the "associated" data structure. So put statistics related to btree in the btree, or in the LSM struct, etc. Perhaps (or maybe not) keep the cursor stats in the dhandle. That in itself is not a huge win, but consider that btrees may be closed somewhat in advance of their associated dhandle, so there could be imore immediate benefits . We'd need to do some perf analysis to see if this is worth it.
Another thought is to have (what used to be) an array of stats now represented by a small bitmap and a pointer to an array. The bitmap says which groups of stats are represented in the bitmap: LSM? ColumnStore? Tiered? Compression? And then the array is sized accordingly. Via some fancy indirection, statistics accesses could be nearly as cheap as they are now. This idea is appealing as a smaller project, as the changes could be quite isolated - most MDB Btrees would save about 15 entries (column store and LSM not needed) and gives us more freedom in adding statistics for other projects.