[SERVER-41905] Add in-memory checksumming to external sorter Created: 25/Jun/19 Updated: 29/Oct/23 Resolved: 24/Jul/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.14 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Louis Williams | Assignee: | Gabriel Villasana (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Backwards Compatibility: | Fully Compatible |
| Sprint: | Execution Team 2019-07-01, Execution Team 2019-07-15, Execution Team 2019-07-29 |
| Participants: |
| Description |
|
When spilling data to disk in the external sorter, we need to verify that the data read back from disk is intact. We could implement in the simplest form with a modular sum, which would neglect the ordering of data, but it may be better to use a position-dependent checksum if possible. |
| Comments |
| Comment by Githook User [ 24/Jul/19 ] |
|
Author: {'name': 'Gabe Villasana', 'username': 'villagab4', 'email': 'villagab4@gmail.com'}Message: |
| Comment by Geert Bosch [ 08/Jul/19 ] |
|
Correction, we already do Snappy compression here. |
| Comment by Geert Bosch [ 26/Jun/19 ] |
|
Thinking about this more, we probably should hold off using any kind of compression for the time being. Just checksum. As a follow up to this project we should first consider using prefix compression: we're writing out non-trivial numbers of KeyString values in sorted order, so there are going to be repeated prefixes. Then it might still be worth using a streaming compressor as follow up to that. |
| Comment by Louis Williams [ 26/Jun/19 ] |
|
geert.bosch the modular sum idea originated because we were unsure of the complexity of reading back the data from disk in the same order that it was written. We wouldn't want to add an additional pass over the data on disk. However, from how I understand the sorter works now, I don't think it will be very difficult. |
| Comment by Geert Bosch [ 26/Jun/19 ] |
|
Please use something halfway decent, never use modular sum for anything. These days we have fast hardware accelerated CRC32C, see op_msg.cpp and wiredtiger_crc32c_func. Or use another hash function that we're already using, such as wiredtiger_crc32c_func. Should we actually be using a compressor here and have it guarantee some consistency? |