[SERVER-41905] Add in-memory checksumming to external sorter Created: 25/Jun/19  Updated: 29/Oct/23  Resolved: 24/Jul/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.1.14

Type: Task Priority: Major - P3
Reporter: Louis Williams Assignee: Gabriel Villasana (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Sprint: Execution Team 2019-07-01, Execution Team 2019-07-15, Execution Team 2019-07-29
Participants:

 Description   

When spilling data to disk in the external sorter, we need to verify that the data read back from disk is intact.

We could implement in the simplest form with a modular sum, which would neglect the ordering of data, but it may be better to use a position-dependent checksum if possible.



 Comments   
Comment by Githook User [ 24/Jul/19 ]

Author:

{'name': 'Gabe Villasana', 'username': 'villagab4', 'email': 'villagab4@gmail.com'}

Message: SERVER-41905 Add in-memory checksumming to external sorter
Branch: master
https://github.com/mongodb/mongo/commit/f50faf0265a66ee7811d01701ce1c9fb91bdce90

Comment by Geert Bosch [ 08/Jul/19 ]

Correction, we already do Snappy compression here.

Comment by Geert Bosch [ 26/Jun/19 ]

Thinking about this more, we probably should hold off using any kind of compression for the time being. Just checksum. As a follow up to this project we should first consider using prefix compression: we're writing out non-trivial numbers of KeyString values in sorted order, so there are going to be repeated prefixes. Then it might still be worth using a streaming compressor as follow up to that.

Comment by Louis Williams [ 26/Jun/19 ]

geert.bosch the modular sum idea originated because we were unsure of the complexity of reading back the data from disk in the same order that it was written. We wouldn't want to add an additional pass over the data on disk. However, from how I understand the sorter works now, I don't think it will be very difficult.

Comment by Geert Bosch [ 26/Jun/19 ]

Please use something halfway decent, never use modular sum for anything. These days we have fast hardware accelerated CRC32C, see op_msg.cpp and wiredtiger_crc32c_func. Or use another hash function that we're already using, such as wiredtiger_crc32c_func. Should we actually be using a compressor here and have it guarantee some consistency?

Generated at Thu Feb 08 04:59:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.