[SERVER-77705] Replace MurmurHash3 with crc32c in sorter.cpp Created: 01/Jun/23 Updated: 14/Dec/23 Resolved: 07/Dec/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.3.0-rc0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Dan Larkin-York | Assignee: | Ivan Fefer |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Sprint: | QE 2023-11-13 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 35 | ||||||||||||||||||||||||
| Description |
|
MurmurHash3 is a pretty old hash function that's slow and produces low-quality hashes compared to modern alternatives like absl::Hash (CityHash). We want to replace any uses of MurmurHash3 which are for in-memory use only, as we expect this to yield quick perf wins. Hash values that are persisted to disk across restarts or sent across the network between servers are likely unsafe to change, and should be annotated with a clear explanation for why the usage requires a stable hash computation. |
| Comments |
| Comment by Githook User [ 14/Dec/23 ] |
|
Author: {'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}Message:
GitOrigin-RevId: 944882fabecca51362327959f828332c72060155 |
| Comment by Billy Donahue [ 12/Dec/23 ] |
|
Hi, this broke the macOS unit test build. Thx! I think you'll find that the problem is an overload ambiguity when you give size_t to a function with integer overloads and there's no exact match. size_t != uint64_t . |
| Comment by Githook User [ 07/Dec/23 ] |
|
Author: {'name': 'Ivan Fefer', 'email': 'ivan.fefer@mongodb.com', 'username': 'Fefer-Ivan'}Message: GitOrigin-RevId: f699cb2bdaebacae374d315ec5e15e5b3bb57332 |
| Comment by Ivan Fefer [ 31/Oct/23 ] |
|
Custom tests of crc32c shows that this hash is 7-9% faster than CityHash and this is the hash to use for checksum validation. However, to properly use it, we need WiredTiger to expose API for piecewise/seeded computation of crc32c: For now, I will keep CityHash and continue working on supporting and testing versioning mechanism for compatibility. We can swap the hash right before committing to master. |
| Comment by Ivan Fefer [ 30/Oct/23 ] |
|
Good news is that the file format is the same - we still write 64-bit integer to represent check sum, so we only need a new field that represents the version. We can default this field to the old hash and have the sorter to use the new version only under a feature flag. The caveat is that there is a validator on the field that asserts that it is >= 0 (which will not be the case with a full 8 byte cache). > When downgrading the FCV, we can just cancel any active index builds. Do you know if this behavior is already present (I know there are some FCV change hooks) or we have to add it? |
| Comment by Louis Williams [ 30/Oct/23 ] |
Yeah, this was essentially the anticipated problem. We could backport a change to previous versions, but that would impose a downgrade floor, which we generally don't like doing. Or at least we don't want to do that unless it is absolutely necessary. If we think we can get a reasonable perf speedup, we can make this change across two releases: we commit something in 8.0.0 to recognize the new sorter format, and in 9.0 we can change the sorter to use the new format without worrying about downgrade. But I don't think this change warrants a downgrade floor. We could also consider FCV-gating this new file format. If we have an 8.0 binary with FCV 8.0, we'll use the new format and new hash function. When downgrading the FCV, we can just cancel any active index builds. If we have an 8.0 binary with the last-stable or last-continous FCV, we'll use the older hash function and old sorter file format. In this way, we could commit something once to 8.0.
Your results are promising, I was just reiteration Dan's observation that we shouldn't use hash functions that are seeded with random values. |