[SERVER-61599] Evaluate prefix compression for clustered collections Created: 18/Nov/21 Updated: 07/Jan/22 Resolved: 07/Jan/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Louis Williams | Assignee: | Daniel Gomez Ferro |
| Resolution: | Done | Votes: | 0 |
| Labels: | PM-2311-M2 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Sprint: | Execution Team 2021-12-13, Execution Team 2021-12-27, Execution Team 2022-01-10 | ||||||||
| Participants: | |||||||||
| Description |
|
We use WT prefix compression for indexes. Evaluate the tradeoff of storage savings and the performance impact of enabling this for clustered collections. |
| Comments |
| Comment by Daniel Gomez Ferro [ 07/Jan/22 ] |
|
Prefix compression for clustered collections is not universally useful. Opened SERVER-62421 to enable it in conjunction with other configuration changes so that it only triggers when it's advantageous. |
| Comment by Daniel Gomez Ferro [ 07/Jan/22 ] |
|
louis.williams I couldn't find that info on the benchmark runs, but I recreated the TSBS datasets manually. The size difference as reported by db.system.buckets.point_data.stats() is small: TSBS With compression: 227610624 I also had some data for YCSB datasets, in that case the size is larger with compression enabled, presumably because the shared prefix is small (just 4 bytes). YCSB With compression: 75890688 |
| Comment by Louis Williams [ 04/Jan/22 ] |
|
daniel.gomezferro do these tests report the storage size? It would be useful information for weighing the time/space tradeoff here. |
| Comment by Daniel Gomez Ferro [ 27/Dec/21 ] |
|
I ran a perf build with prefix_compression enabled by default for clustered collections (including time series collections): https://spruce.mongodb.com/version/61c9a110562343577524cac7/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC The results for TSBS are a bit worse: Are there other workloads I could try this on? Should we enabled it by default only for clustered collections but not for time series collections? |