[Join Optimization] Non-deterministic join plan selection due to use of oscillating onDiskSizeBytes in cost model

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Optimization
    • Fully Compatible
    • ALL
    • 200
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      The join optimization cost model uses onDiskSizeBytes, computed as 

      recordStore->storageSize() - recordStore->freeStorageSize()

       to estimate the number of on-disk pages for each collection. These values come from WiredTiger statistics (WT_STAT_DSRC_BLOCK_SIZE minus WT_STAT_DSRC_BLOCK_REUSE_BYTES). Even when restoring identical data via mongorestore --maintainInsertionOrder, WiredTiger's page layout is non-deterministic for a number of reasons:

       

      • Page split points depend on cache pressure and eviction timing
      • Checkpoint timing relative to insert batches is wall-clock driven
      • Block allocation depends on the path-dependent free extent list.

      This causes onDiskSizeBytes to vary by small amounts between otherwise identical runs even on the same machine.

      We accounted for this variation as part of SERVER-122592 in the estimation of the number of pages in the wiredtiger cache by rounding onDiskSizeBytes / pageSizeBytes to the nearest power of 2^(1/4), giving a maximum of 9.5% error.

      However, three call sites in join_cost_estimator_impl.cpp: costCollScanFragment(), costIndexScanFragment(), and costINLJ(), computed onDiskSizeBytes / pageSizeBytes directly without quantization. These unquantized values fed into the Mackert-Lohman random I/O estimator, causing the cost model to produce slightly different cost estimates across runs. When cost estimates for competing join strategies are close, this variation is enough to flip the winning plan.

            Assignee:
            Ben Shteinfeld
            Reporter:
            Ben Shteinfeld
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: