Investigate/improve sampling algorithm

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Execution
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      The "equal step" sampling algorithm introduced by SERVER-114237 does not perform well for very skewed Timestamp distributions.

      When running one of the existing tests TruncatesAreOnlyAfterAllDurableReplicatedTruncates in file src/mongo/db/change_stream_pre_images_remover_test.cpp, the test will fail when enabling the "equal step" sampling algorithm.

      The reason is that the test creates several documents with Timestamp values close to Timestamp(1, 0), and then bumps the majority-committed Timestamp to Timestamp(4294969, 2) and creates another document.
      It then expects the pre-images collection to be cleared after a few invocations of the pre-images removal job.

      The test works well when the "equal step" sampling algorithm is not enabled. Then it chooses to scan the entire pre-images collection, and the information about the Timestamp distribution inside the collection is 100% accurate.
      When choosing the "equal step" algorithm, the sampling will only find the two documents with the lowest and highest Timestamp, which are Timestamp(1, 2001) and Timestamp(4294969, 3). The documents in-between (which are all at the very low end of the range) are not found.

            Assignee:
            Unassigned
            Reporter:
            Jan Steemann
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: