Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Won't Do
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Performance
Labels:
None

Assigned Teams:

Storage Execution

Right now we checksum each KV pair independently using murmur3 and then validate after finishing a file that the checksum matched. There are a few issues with this:

Murmur3 is a mediocre hash function at this point, both for perf and error detection
- Fix: use crc32c (from wiredtiger)
Hashing small chunks of data is slower than hashing big chunks
- Fix: We are already producing buffers of data for compression purposes. We should do the checksumming on the big buffer either before or after compression. Doing it before compression makes sure that the decompression produces the right result, but doing it after compression both checksums less data and avoids sending garbage into the decompressor. Since we trust Snappy to decompress correctly when fed good input, I think checksumming after compression make sense.
We wait until we finish with whole files to check the checksums. 1) this wastes work if we could have aborted earlier 2) it risks sending garbage data to consumers who aren't prepared for it. 3) It assumes we will actually reach the end of the file. Consumers like TopKSorter are unlikely to do so.
- Fix: Check the checksum immediately after reading a chunk from the file (and after decompression, if the checksum was computed prior to compression).

is related to

WT-7236 Cache the result of the wiredtiger_crc32c_func() return value

Closed

Assignee:: [DO NOT USE] Backlog - Storage Execution Team

Reporter:: Mathias Stearn

Participants:: [DO NOT USE] Backlog - Storage Execution Team, Mathias Stearn

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: Feb 23 2021 10:56:57 AM UTC

Updated:: Jan 19 2023 06:14:10 PM UTC

Resolved:: Jan 19 2023 06:13:56 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates