[SERVER-37293] Refactor Sorter so that DocumentSourceGroup can use it instead of SortedFileWriter implementation Created: 24/Sep/18  Updated: 21/Jun/19  Resolved: 21/Jun/19

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Daniel Solnik (Inactive)
Resolution: Won't Do Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Gantt Dependency
has to be done after SERVER-17010 Reduce file handle usage in File base... Closed
Sprint: Execution Team 2019-06-17, Execution Team 2019-07-01
Participants:

 Description   

document_source_group.h/cpp skips the Sorter interface and re-implements some of the logic in order to use the SortedFileWriter class directly. This appear to have been done to avoid the sorting phase of Sorter because DocumentSourceGroup has already sorted data.

Therefore, we should expose an option on the Sorter to handle pre-sorted data, and remove the duplicate code from document_source_group.h/cpp. Additionally, document_source_group.h/cpp forces sorter.h to make internal structs public that would otherwise be private to sorter.cpp because it needs to use them directly. FileDeleter and SorterFileInfo (added by SERVER-17010) should be moved into the cpp file along with the interface change.



 Comments   
Comment by Dianna Hohensee (Inactive) [ 21/Jun/19 ]

This change would require significant query code changes, which also could potentially affect performance, so we are not going to do it. DocumentSourceGroup maintains a a private GroupMap (_groups) that DocumentSourceGroup::initialize() fills from a pSource and periodically spills to disk when it reaches a certain size. DocumentSourceGroup::spill() performs its own stable_sort on the contents of _group (that looks like it is sorted by an id) and then translates the _group data structure entries into a format for disk. Then DocumentSourceGroup::spill() uses a SortedFileWriter and SortedFileWriter::addAlreadySorted() to spill to disk. It would be difficult to move this functionality into the Sorter interface, and may hurt performance, as Dan described in more detail above.

Comment by Daniel Solnik (Inactive) [ 19/Jun/19 ]

It seems like the DocumentSourceGroup has two separate codepaths for depending on whether or not the groups were spilled to disk.
When the data is spilled to disk it is serialized (the accumulators put into their value form) and so it needs to be deserialized later.
However, if the data is never spilled to disk it is never serialized and so there is no need to deserialize it.
This can be seen in DocumentSourceGroup::spill() (where the data is serialized into value form) and in DocumentSourceGroup::getNext() (which uses a different iteration method depending on whether or not it needs to deserialize, either getNextSpilled or getNextStandard).
This class executes different code in the case when sorting spills to disk and when sorting can be done in memory, which the current Sorter interface doesn’t support.

One way to use the sorter interface in DocumentSourceGroup is by always serializing the accumulators and then adding the (id, accumulators) and using the sorter to sort by id and then deserializing the results from the sorter. However, this change would result in a possible perf hit when a spill to disk is not necessary (when the groups can be kept in memory) as currently this serialization is not done if the groups can be kept in memory. 

Generated at Thu Feb 08 04:45:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.