[SERVER-61599] Evaluate prefix compression for clustered collections Created: 18/Nov/21  Updated: 07/Jan/22  Resolved: 07/Jan/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Louis Williams Assignee: Daniel Gomez Ferro
Resolution: Done Votes: 0
Labels: PM-2311-M2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File tsbs_query.png    
Issue Links:
Related
is related to SERVER-62421 Enable prefix compression on clustere... Backlog
Sprint: Execution Team 2021-12-13, Execution Team 2021-12-27, Execution Team 2022-01-10
Participants:

 Description   

We use WT prefix compression for indexes. Evaluate the tradeoff of storage savings and the performance impact of enabling this for clustered collections.



 Comments   
Comment by Daniel Gomez Ferro [ 07/Jan/22 ]

Prefix compression for clustered collections is not universally useful. Opened SERVER-62421 to enable it in conjunction with other configuration changes so that it only triggers when it's advantageous.

Comment by Daniel Gomez Ferro [ 07/Jan/22 ]

louis.williams I couldn't find that info on the benchmark runs, but I recreated the TSBS datasets manually. The size difference as reported by db.system.buckets.point_data.stats() is small:

TSBS

With compression: 227610624
Without compression: 228003840
Difference: 393216 (~400 KB, 0.17% reduction)

I also had some data for YCSB datasets, in that case the size is larger with compression enabled, presumably because the shared prefix is small (just 4 bytes).

YCSB

With compression: 75890688
Without compression: 68009984
Difference: -7880704 (~7.5 MB, 11.6 % increase)

Comment by Louis Williams [ 04/Jan/22 ]

daniel.gomezferro do these tests report the storage size? It would be useful information for weighing the time/space tradeoff here.

Comment by Daniel Gomez Ferro [ 27/Dec/21 ]

I ran a perf build with prefix_compression enabled by default for clustered collections (including time series collections): https://spruce.mongodb.com/version/61c9a110562343577524cac7/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

The results for TSBS are a bit worse:

Are there other workloads I could try this on? Should we enabled it by default only for clustered collections but not for time series collections?

Generated at Thu Feb 08 05:52:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.