[SERVER-18396] Expose LSM configuration options to the engine, index and collection creation commands Created: 08/May/15  Updated: 07/Apr/23  Resolved: 06/Aug/19

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Daniel Pasette (Inactive) Assignee: Backlog - Storage Engines Team
Resolution: Won't Fix Votes: 18
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-23423 wiredtiger LSM engine support? Closed
Related
Assigned Teams:
Storage Engines
Participants:

 Description   

First need to determine the exact settings that need to be exposed.

Also need to be available to wiredTigerCollectionConfigString and wiredTigerIndexConfigString.



 Comments   
Comment by Brian Lane [ 06/Aug/19 ]

I am closing this issue which we can revisit if we create a project to expose LSM in the future.

-Brian

Comment by Юрий Соколов [ 13/Mar/19 ]

> Creating default settings for LSM trees that are good for a variety of users is difficult.

Theory of LSM moves forward. Looks like researchers touch adaptive algorithms: https://stratos.seas.harvard.edu/files/stratos/files/dostoevskykv.pdf

Comment by Alireza [ 15/Sep/17 ]

@Alexander Gorrod, Thanks for the fast response.

I think the reasons you mentioned will not be an issue for an experienced team with mongodb. For example in our usecase, we're handling huge amounts of TPS, using redis, until they stabilize (mostly 15 minutes or so after the transaction occur). Then another background process would insert them into mongodb. So what we have here is mostly pure db writes, no fast reads or updates (since most reads and updates occur early, and in that time, the records are in redis, not in mongodb). In this scenario, seems to me that LSM provides best of every worlds. There's enough time for the background index creation and tree balancing to happen when documents are inserted, since they will rarely (if ever) be read or updated after the 15 minutes interval, and we also get the faster/linearly-scalable insertion.

So all in all, it seems a design decision definitely, but I guess it will be safe just to document different scenarios for LSM and let the users choose what settings serve them best! (Since WiredTiger added LSM support from 2.3 [If I'm not wrong] and one can be familiar with the settings)

On your last point though and the number of files with regards to number of collections and ..., I don't know, this one seems serious but at least for people who won't host so much collections inside their DBs, can it still be a problem ?

All in all I think this can be officially supported, with a huge warning that you should know what you're doing and LSM fits your usecase, maybe with an optional document on pitfalls and fallacies would also be great.

Comment by Alexander Gorrod [ 15/Sep/17 ]

SpiXy thanks for asking. We have not enabled official support for this feature for a number of reasons:

  • In our testing there are a very limited number of scenarios where MongoDB provides better characteristics when using LSM than the WiredTiger btree implementation. We want MongoDB users to have a good experience when trying out LSM.
  • Creating default settings for LSM trees that are good for a variety of users is difficult.
  • There is a higher threshold for managing LSM trees from an operational point of view, due to the additional files used. We are working on other projects to mitigate those requirements, and would like them to be completed prior to enabling LSM support.
Comment by Alireza [ 14/Sep/17 ]

Any timing for this feature?
It's been open for more than two years, but still no news of when can we expect this.

Comment by Michael Lee [ 20/Jul/17 ]

Typical LSM use case here - very high volume of inserts and occasional queries against that data.

Comment by Mike Cahill [X] [ 16/Jan/17 ]

Alexander -

Thanks for the quick reply!

It's a classic LSM use case - very high volume of inserts and a small number of retrievals against that data.

Comment by Alexander Gorrod [ 16/Jan/17 ]

Mcahill These JIRA tickets are the best place to express your interest. Can you give us a description of your use case and why you think LSM data storage format would help?

Comment by Mike Cahill [X] [ 13/Jan/17 ]

Is there a venue to discuss scheduling of this feature, or is voting the only mechanism? Thanks.

BTW, I'm not that Mike Cahill.

Comment by Alexander Gorrod [ 03/Apr/16 ]

flozano We do intend to implement a solution for that - the work is tracked in SERVER-17675.

Comment by Francisco Alejandro Lozano López [ 03/Apr/16 ]

Do you think you could consider making this version of wiredTiger "friendly" with huge (millions) numbers of collections? In many cases, the collection is a very natural abstraction but the current wiredTiger model of file per index/collection doesn't allow taking advantage of it.

Comment by Alexander Gorrod [ 12/May/15 ]

Following are my recommendations for which WiredTiger LSM configuration options to expose, and a stab at choosing default values for them.

There are different places in the WiredTiger API that expose LSM tuning parameters. I've split them into two tables below.

For wiredtiger_open

Corresponds to MongoDB database create or open.

Setting Exposed Default Reason
lsm_manager=(worker_thread_max) Yes 8 Gives lower throughput applications a way to limit the amount of background maintenance that happens for LSM trees.
lsm_merge= No As per WiredTiger (true) Disables background merging (compaction) for LSM trees. This is really a benchmark optimization, not recommended for regular users

For WT_SESSION::create and WT_SESSION::reconfigure

Correspond to MongoDB collection or index creation and alter.

Setting Exposed Default Reason
lsm=(bloom) Yes true If query performance isn't relevant, bloom filters are wasted effort.
lsm=(bloom_config) No Empty The default settings are the best choice.
lsm=(bloom_bit_count) Yes Per WiredTiger (16) There is a space trade off between bloom effectiveness and space usage that will vary depending on user needs.
lsm=(bloom_hash_count) Yes Per WiredTiger (8) As for bloom_bit_count.
lsm=(bloom_oldest) Yes Per WiredTiger (off) Applications that query for items that aren't in the database benefit from having a bloom filter on the oldest chunk, applications that never query for items that aren't in the database don't benefit.
lsm=(chunk_count_limit) No 0 This changes LSM to time out old data (implemented for possible future oplog optimization) - it exposes a method of automatic data loss. If MongoDB wants to expose the functionality, I recommend completely separating it from exposing LSM.
lsm=(chunk_max) Yes Per WiredTiger (5GB) Largest chunk that can be generated by a merge (compaction). Very large merges take a long time and a lot of resources. The value set here depends on the underlying hardware, the expected total table size.
lsm=(chunk_size) Yes Per WiredTiger (10MB) The higher the insert throughput the larger I generally recommend this value being set. OTOH if an application creates lots of LSM trees - they need to set this size so that at least 3 x chunk_size x LSM tree count fit into cache.
lsm=(merge_max) No Per WiredTiger (15) This is advanced tuning, more likely to lead to confusion than performance gains.
lsm=(merge_min) No Per WiredTiger (0) As for merge_max.
Generated at Thu Feb 08 03:47:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.