-
Type: Task
-
Resolution: Done
-
Affects Version/s: None
-
Component/s: None
-
None
Our default Bloom filter settings are 8 bits per record with 4 hashes. This has a false positive rate of ~2.4%.
For example, with an LSM tree consisting of 10 chunks, for every 1000 LSM searches, we will do ~240 unnecessary btree searches (24 per chunk). For an out-of-cache workload, this is an overhead of ~24% (i.e., assuming all internal pages are in cache and none of the leaf pages, we read 1240 leaf pages when we only need 1000).
If we changed the defaults to 16 bits per record with 8 hashes, that false positive rate goes to 0.057%, so the same 1000 LSM searches would only do ~6 unnecessary btree searches. The I/O overhead would drop to 0.6%.
The tradeoff is that we would double the space required for Bloom filters, and double the amount of computation associated with Bloom filter lookups. (We could do fractionally better with 12 hashes than 8, but that's 50% more computation to go from 0.6% to 0.5%).
For reference, the riak tests currently use 28/19, which has a false positive rate of 1.44e-06.
The change is trivial to make, but what are your thoughts on the implications of doubling the default space required for Bloom filters?