Some areas for improvement are:
- minimizing code duplication, e.g. creation of the SamplingEstimatorImpl instances
- make the document generation process more robust. One way we could achieve this would be to allow the caller to specify document schemas of the collections we are testing
- fix typos, e.g. iniitalizeSamplingAlgoBasedOnChunk()
- use parse and fromjson to construct the MatchExpressions