[SERVER-17675] Support file per database for WiredTiger Created: 20/Mar/15 Updated: 22/Jun/22 Resolved: 10/Dec/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage, WiredTiger |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Mark Callaghan | Assignee: | Brian Lane |
| Resolution: | Won't Do | Votes: | 37 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||
| Description |
|
When the WT b-tree uses a file per collection or index then it uses a huge number of files for deployments that have a huge number of collections. I want an option to use a file per database for the WT-btree and an LSM tree per database for the WT LSM. For motivation see |
| Comments |
| Comment by Brian Lane [ 10/Dec/20 ] |
|
I am closing this issue as won't do. After multiple discussions internally, this functionality will not make it to the storage engine roadmap anytime soon. In the short term, we are instead focusing our efforts on improving behaviors when there are a large number of collections. Recently we have completed and backported |
| Comment by Benety Goh [ 13/Oct/20 ] |
|
brian.lane, we discussed this during our morning standup. Do you have any thoughts on this ticket? |
| Comment by Peter Colclough [ 20/Aug/20 ] |
|
Yea.. I was playing 'devils advocate' when I said Mongo may not be the data store... >...One interesting solution we came up with was to have numerous mongo instances...<
Good idea though... at least it gets over the current issues for you... in a way.
|
| Comment by Chad Kreimendahl [ 20/Aug/20 ] |
|
Mongo is definitely the correct data store for us. We've tested what we need on numerous other solutions with horrid performance implications. There are a couple that could perform very close maybe even slightly better, but then they're 20 times more difficult to program against. Shout out to the c sharp team There's no reason for us to separate each client onto their own mongo instance. We simply keep track of our instances until they get to about 60k tables, and then roll to the next cluster. We do not host in the cloud so we have to bring them up each time. One interesting solution we came up with was to have numerous mongo instances running on each server. Since we can stuff the server with a terabyte of memory if we need and CPU is never a problem it works very well. And since we don't use secondaries almost ever we can also throw several secondaries on each machine knowing they won't cause problems |
| Comment by Peter Colclough [ 19/Aug/20 ] |
|
Chad.. thats interesting., What oyu are describing is essentially, something like 'wordpress'... where the application and data is unique per customer. (Not actually like wordpress... but similar in that the data is always customer centric). A couple of points.
I think option (1) is probably better...as that would cover all your requirements, and you will just end up with standalone PSA mongo units per customer. I seriously dot think that any enhancement to WT (such as single files for indexes and/or collections) will get you any improvement in what you are doing, apart from the file handle issue.
|
| Comment by Chad Kreimendahl [ 19/Aug/20 ] |
We don't have a processing limitation. We have an "open filehandles" limitation, it would seem. Our load is rarely above 2.0 (on a 64 core machine) on even our most busy cluster. Our performance absolutely dominates our competition. Our major issue is startup time, and the occasional schema lock on any system with more than about 400,000 "tables". |
| Comment by Peter Colclough [ 19/Aug/20 ] |
|
Thanks all for the replies. As an insight.. I am in the process of fixing a similar setup (in PSA mode) for a client, and its not been easy .. they were also about 40 versions behind current As regards Chad Kreimendahl example, (thanks for this). 1. 'Dev' databases should not be on a production system. They tend to be used differently, and so cause issues with production just by being there. However this does nor solve the collective issue... just puts off the day when something breaks. 2. WT does work differently, essentially it tries to load all indexes into memory, keeping the file open for updates. It also loads all recently loaded data, as much as it has space for, for quicker access. 3. Lets take that example... and I am 'flying blind' here without the data.. but here goes. If you add the 'database name' as a field to each collection, then all those separate databases can go into one Db. This maintains the 'one database per customer' constraint (without the customer being aware)... while still enabling the collection to be sharded. It reduces the number of collections and indexes to 50 x 40 x 1 = 2000. 4. Sharding.. if no 'obvious' choice exists, can be a simple hash of the _id field. This will produce cardinatlity, and near equal distribution (assuming the _id is an ObjectId). If using 4.4+ a compound hash can be used, covering multiple fields, giving greater cardinality .. but this is a case by case basis. Sharding effectively produces parallel processing, and splits the aggregation processes, and sorting away form the data nodes (for those in PSS/PSA mode) , enabling a separation of concerns, and so allowing more memory on data nodes to be handed to the core data, while the mongos units can be split between access points, and deal with the sorting and further aggregation of a total result set. The Issue with a monolith index or collection file I stand to be corrected here, but I believe this will cause more issues (unseen but explainable) than it will solve. If we had a single index file per collection, it would reduce the file handles used... but it could lead to multiple cache swaps as space was made to deal with other queries. Generally only one index is used per query... and the profiler will tell you which one. I would also suspect it would lead to index file bloat, as deleted/updated items would not be handed back to the OS, and just reused. This could lead to internal fragmentation of the indices if not handled correctly. Defragmentation would have to occur for efficient usage, and that would also take time.. on a heavy usage site this could be an issue. Essentially the same would occur with collection data. Maybe less of an issue as this could be dealt with on disk... but would still be an issue. We have seen this issue with MySql and a single innoDB file, where data and indexes were all in one file. The time came when it wont load into memoryt, and things grind to a halt. Its worth also mentioning that index analysis has served me well. What seemed like a good index to create, after time ended up not being used, so able to remove. Yet again.. in a monolith file, this would cause a rebuild of that file if teh index were dropped... probably causing a lock on teh collection while it happens. Francisco, I thank you for teh explanation, however, Mongo is not like other databases. The logic of splitting down to one database per customer sort of works with relational tech (30+years old) , but not with MongoDb, which attempts to be a 'in memory' db, without being completely in memory. Item (1) may be logical, but once you get to a certain point, it becomes illogical, unless you set up multiple clusters. Even in the relartional world, it causes issues, different issues, but still centred on cache swapping. The major issue with Mongo is the attempt to keep all indexes open across the server... and the resulting file handle issue. With 100kk tenants no Db will work in that scenario (that I am awe of Item (2) different collections. For sure.. but similarly named collections should have at least some similar data. Data specific to that customer could simply be an attached object to that collection. Others that are created for a given customer only, could still resort to the hashed _id shard key method. Item (3) is a concern, I agree. However a hash of the _id field, as a bare minimum, gets you the split of data across shards.. assuming the _id is an Objectid. Is there somewhere else we can have this discussion? happy to help out further... and thanks for the background everyone.
|
| Comment by Francisco Alejandro Lozano López [ 19/Aug/20 ] |
|
Thanks for your insight Peter. It's a "design issue" with regard to how MongoDB works internally with WT, but it's not a "design issue" if you take into account that:
If you had those requirements a few years ago and were to fulfil them with MongoDB, a similar design could probably come up - especially if you had your system designed around a time when MMap mostly worked with this approach, and MongoRocks was a thing and totally worked with this approach. Now, both mmap and mongorocks are mostly gone and the remaining engine (WT) has been integrated in MongoDB in a way that works against this approach. This issue is about improving that situation. now, if you were to fulfil these requirements with MongoDB today, you could probably come up with a different design, which would probably be much less intuitive. But even with a new clean design, you could fix the items 1 and 2 above - but I can't think of a way to fix item 3 in a manageable way. |
| Comment by Nathan Thiya [ 19/Aug/20 ] |
|
Chad Kreimendahl summarized it really well. If you have 500 customers, 50 collections per database and 10 indexes per collection. 500 x 50 x 10 =250,000 files. Where as this could be just 500 files with 1 file per database. If and When if this is going to be implemented, developer has to keep the performance in mind. We don't want to compromise the performance for this feature. |
| Comment by Chad Kreimendahl [ 19/Aug/20 ] |
|
What if you have tens of thousands of customers that need their database logically separated.. which would involve separate mongo databases. And in each one of those databases the customer can configure any number of pieces of data that may contain millions of records or one or two records. Since each data type they can configure is unique and some customers have hundreds of these configurations... You necessarily end up with millions of tables. Tables and mango terms as a combination of collections and indexes. So let's shrink it a little bit. Let's say you have 500 customers. That gives you about 800 databases between development instances and other miscellaneous databases. And in each database you have 40 unique data types. Within those data types you have say 50 Fields all of which need to be indexed. 40 x 50 x 500 = 1,000,000 |
| Comment by Peter Colclough [ 18/Aug/20 ] |
|
Apologies in advance, having just found this ticket while researching for a client. Is this ticket about getting WT to fix a bunch of database design issues? I am amazed at figures like 'millions of databases/collections/indexes' . I totally understand that development constructs can lead into production, but I would be truely interested to understand how it gets to this state. I use MongoDB and Elasticsearch. I have built multi terrabyte (petabyte in one case), systems spanning multiple datacentres, as fully sharded systems. It is possible to shard any mongo system that I am aware of, which gets around 90% of the issues people seem to be having. From a practical perspective I would be interested on hearing of any systems that cant be sharded and see the reasons why. FTR.. I am an independant consultant, not a Mongo employee.... just intrigued to see how these systems have come about. Mainly also because asking Mongo to fix a DB to get over a 'localised issue' seems a bit unfair. |
| Comment by Francisco Alejandro Lozano López [ 15/Jun/20 ] |
|
Hi, I see the status of the issue changed to INVESTIGATING. Is there any chance this feature is being considered again? |
| Comment by Brian Lane [ 29/Jan/19 ] |
|
Hi flozano, This issue remains in our product backlog and is currently not scheduled for a particular release. There was some work done in this area but was deprioritized to complete other issues for the release of 4.0. We would also like you to be able to migrate over to WiredTiger and understand your issue with such a large number of collections. -Brian |
| Comment by Chad Kreimendahl [ 24/Jan/19 ] |
|
Michael, Our method of dropping databases without locking the whole system is to individually delete each collection until they are all deleted, wait a few seconds, then delete the database. You end up with a significantly shorter delete cycle. The deletion process seems to take as much time as total number of collections + indexes. So 9 indexes in 1 collection takes as long to delete as 2 collections with 8 total indexes. Since we're significantly index-heavy, our "drop" was taking over 8 minutes on most databases. With our new drop method, it takes more than 8 to totally wipe it, but you can run 2-3 concurrently without major impact. |
| Comment by Francisco Alejandro Lozano López [ 24/Jan/19 ] |
|
Is there any change on the status of this topic asya ? or file per database has been discarded ? |
| Comment by Francisco Alejandro Lozano López [ 24/Jan/19 ] |
|
With WT I doubt very much you can achieve 8 million collections, as you would have to handle millions of open files. I'd love to be corrected by MongoDB team. |
| Comment by Vipin Garg [ 04/Jan/19 ] |
|
flozano, how your system is working, what all challenges you are facing in your environment ? We also going to have a environment with almost 8000 databases and total 8 million WT tables. So, looking for some recommendations.
|
| Comment by Michael [ 25/Jul/18 ] |
|
We have similar issues. In our test environment we have thousands of databases at any given time, each with a few hundred collections in them. We try to periodically clean up the databases, but it usually takes more than a minute to drop a single database. We're literally creating databases faster than they can be dropped. |
| Comment by Francisco Alejandro Lozano López [ 23/May/18 ] |
|
Basically we have:
Dealing with this layout in WiredTiger is impossible with current file-per-collection|index. |
| Comment by Asya Kamsky [ 22/May/18 ] |
|
flozano can you provide some approximate sizes in terms of number of databases, collections and indexes in a typical (or maybe largest) of the replica sets you have?
|
| Comment by Francisco Alejandro Lozano López [ 16/May/18 ] |
|
Similar to the comment above, we already shard at app-level, mostly per-tenant (so the whole tenant sits in a single replica-set) but in other cases also by other parameters. Just mongoDB-level sharding won't help here (we actually do mongoDB sharding it in other use-cases, but not here). The problem is the large number of collections in a single-server (and the associated collection indexes), not necessarily from a single tenant. We don't know beforehand the collections that will be created, similar to above. This model worked relatively well with MMAPv1, and worked well with MongoRocks (there is a significant memory footprint that relates linearly to the number of actual collections, but we could survive it). However, it won't work with current WiredTiger approach of file-per-collection-or-index. Decreasing the number of indexes will help but will not fix it in our case, as it's related more to the number of collections. We would love to be able to move to WiredTiger, as it clearly seems to be the way forward in the MongoDB ecosystem. Every other storage engine is getting deprecated (perconaFT, Rocks, MMAPv1) so I guess some additional flexibility in WiredTiger storage engine would be greatly appreciated, in order to better accommodate the models that were previously covered with the alternative engines.
|
| Comment by Chad Kreimendahl [ 16/May/18 ] |
|
For us, the issue is almost entirely caused by indexes. Here's stats for just 1 of our production databases:
On this particular system, we use what is currently the worlds fastest SSD. The time for mongo to restart, on the rare occasion it has to, is just over 6 minutes. We've just recently migrated this specific database to 3.6. We're also noticing that we're seeing quite a few slow requests happen at the same time that flushes happen. This may also be tied to the number of files. Specifically, we can switch the master to a 3.4 database in the cluster and all the slow requests go away... but when on 3.6 we get short bursts of slow requests (2-3 seconds when normally 20-30ms) all seeming to happen during large flushes. (new bug to come on this once we have more detail)
I do think that if you have a way to reduce the number of indexes, that may solve a ton of our problem. The problem is files (to a very large degree), so anything that makes less of those is a positive. Sharding would be nearly impossible for us. We don't know the collections that will exist. Most collections have less than 1,000 records. Sharding is also nearly pointless, as we just distribute our customers over multiple replicasets. It's easier to do for several reasons: 1. backups; 1a. restores; 2 startup time; 3. management; 4. configuration; I would say, if you could auto-shard and scale like Elastic, we'd change our model almost immediately. We use Elastic for a bunch of internal stuff, and for its limitations, the ability to scale and shard automatically is exactly what mongo needs. |
| Comment by Asya Kamsky [ 16/May/18 ] |
|
There has been a lot of work done to improve performance with WiredTiger and large number of collections. Is the issue large number of collections on a single server, or large number of files (i.e. large number of indexes)? We have a project in the works that may allow significantly decreasing the number of indexes in a collection. We do have a lot of customers that use DB per tenant but they many of them shard once the number of tenants tenants grows large, which allows distributing the databases across multiple shards, resulting in fewer collections on each server.
|
| Comment by Francisco Alejandro Lozano López [ 11/May/18 ] |
|
Thanks. The recent deprecation of MongoRocks by Percona [1] leaves applications which make heavy use of collections (eg: multi-tenancy) in a really bad place, with no other option than to squeeze `ulimit`. Is there any chance MongoDB will reconsider this feature?
|
| Comment by Asya Kamsky [ 25/Apr/18 ] |
|
This feature is currently in the backlog and not scheduled for any specific future release. |
| Comment by Francisco Alejandro Lozano López [ 25/Apr/18 ] |
|
Is this still scheduled for 3.8? |
| Comment by Dieter Guendisch [ 25/Apr/18 ] |
|
Same issue for us: we have some 10k up to 100k tenants, each having 3-5 collections and a bunch of indexes, thus the current wiredtiger file approach doesn't work for us, so we can only stick with mmap. |
| Comment by Chad Kreimendahl [ 19/Jul/17 ] |
|
This is actually an extreme issue for us. We have one replicaset that currently has 480,309 files. 439,437 of those files are for indexes alone. If you could simply make it 1 file per collection's entire index-set, that would do an insane amount of good for us. We only have 40,824 collection files, which in theory would double us to ~82k vs ~480k. It's all well and good to change ulimit's for open file handles to > 500k, but it makes us approach practical limits for the OS / CPU. This specific cluster would be able to handle 5 times as many customers, without issue, with my described change. I understand it's not that easy, but hope you could find a way to get it back into 3.6 or even 3.4.x. |
| Comment by Chad Kreimendahl [ 10/Jul/17 ] |
|
Looks like that's a ways away, given no 3.6 yet? |
| Comment by Ramon Fernandez Marina [ 10/Jul/17 ] |
|
You are correct sallgeud, this work is now scheduled for MongoDB 3.8. Regards, |
| Comment by Chad Kreimendahl [ 06/Jul/17 ] |
|
Did this move out of the next version? |
| Comment by Francisco Alejandro Lozano López [ 30/May/17 ] |
|
That would be amazing. Really. |
| Comment by Asya Kamsky [ 30/May/17 ] |
|
At the moment we are planning to have (as an option) a file per DB that would contain all collections for that DB and a second file for the indexes. |
| Comment by Francisco Alejandro Lozano López [ 26/May/17 ] |
|
Asya, that would be great news for us too. Would you mind explaining a little about what model will you follow? file per DB? file per collection+index? |
| Comment by Asya Kamsky [ 25/May/17 ] |
|
We are in fact making progress towards having fewer files per database in Wired Tiger. Asya |
| Comment by Chad Kreimendahl [ 25/May/17 ] |
|
Since there appears to be motion on this. I'd like to add more detail. Our problem right now is almost entirely the number of files created to accommodate indexes. While one file per database might be nice for some, I'd imagine that some of the performance benefits of unique file per collection would be lost. However, I would absolutely love to be able to reduce the index files from 1 per index to 1 per collection. I'm not saying merge index file into collection file... simply allow the option to have 1 file for each collections entire set of indexes. We average 30+ indexes per collection, so this would be huge for us. |
| Comment by Chad Kreimendahl [ 02/Aug/16 ] |
|
I'd be interested in simply allowing one-file per index. If we could get this configurable to do 1 file per collection and 1 file per collection's index, that would be excellent, and entirely remove any issues we currently have. |
| Comment by Daniel Doubrovkine [ 03/Jun/16 ] |
|
Coming from |