-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Query Optimization
-
Fully Compatible
-
ALL
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The join optimization cost model uses onDiskSizeBytes, computed as
recordStore->storageSize() - recordStore->freeStorageSize()
to estimate the number of on-disk pages for each collection. These values come from WiredTiger statistics (WT_STAT_DSRC_BLOCK_SIZE minus WT_STAT_DSRC_BLOCK_REUSE_BYTES). Even when restoring identical data via mongorestore --maintainInsertionOrder, WiredTiger's page layout is non-deterministic for a number of reasons:
- Page split points depend on cache pressure and eviction timing
- Checkpoint timing relative to insert batches is wall-clock driven
- Block allocation depends on the path-dependent free extent list.
This causes onDiskSizeBytes to vary by small amounts between otherwise identical runs even on the same machine.
We accounted for this variation as part of SERVER-122592 in the estimation of the number of pages in the wiredtiger cache by rounding onDiskSizeBytes / pageSizeBytes to the nearest power of 2^(1/4), giving a maximum of 9.5% error.
However, three call sites in join_cost_estimator_impl.cpp: costCollScanFragment(), costIndexScanFragment(), and costINLJ(), computed onDiskSizeBytes / pageSizeBytes directly without quantization. These unquantized values fed into the Mackert-Lohman random I/O estimator, causing the cost model to produce slightly different cost estimates across runs. When cost estimates for competing join strategies are close, this variation is enough to flip the winning plan.
- is related to
-
SERVER-122592 Improve estimation of WT number of pages in buffer
-
- Closed
-