[SERVER-164] Option to store data compressed Created: 20/Jul/09  Updated: 08/Feb/23  Resolved: 05/Nov/14

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 2.8.0-rc0

Type: New Feature Priority: Minor - P4
Reporter: Ask Bjørn Hansen Assignee: Eliot Horowitz (Inactive)
Resolution: Done Votes: 280
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-15953 Integrate WiredTiger storage engine Closed
Duplicate
is duplicated by CDRIVER-39 Need a new feature of "compression" i... Closed
is duplicated by SERVER-9782 Compress Oplog Closed
Related
related to SERVER-863 Tokenize the field names Closed
related to SERVER-15974 Vendorize zlib 1.2.8 Closed
Tested
Participants:

 Description   

Compression is now supported by the new WiredTiger storage engine.
Snappy disk compression is on by default and there are more options configurable.
See SERVER-15953 for more information.

WAS:
When storing textual data (and having more CPU than IO capacity) it'd be nice to have an option to have the data stored gzip compressed on disk.



 Comments   
Comment by Kay Agahd [ 28/Mar/14 ]

Using TokuMX 1.4 we get 10x compression factor and 5x write throughput.

Comment by Ben McCann [ 28/Mar/14 ]

Thanks for sharing! It's great to see numbers around it. To share some of our benchmarking, we've been testing TokuMX and have found that we get a 5x compression factor and 3x write throughput, which is a pretty awesome win.

Comment by James Blackburn [ 28/Mar/14 ]

We've done some benchmarking, yes.

We're running zfs on linux 0.6.2 on RHEL6 and this setup is very new here. Throughput, with an I/O bound workload, is near-on identical to a replicaset backed by ext4. Though this should be taken with a pinch of salt as:

  1. the application we're running is fairly data intensive, so we already do in-app compression with lz4
  2. The databases are very new and the traffic is mostly write only

1) Gives gives us end-to-end I/O saving including network traffic and memory load on the mongodb servers. With 2 it's not clear how, or whether, performance will degrade as the DB ages and ZFS's COW nature causes leads to fragmentation. Given MongoDB's nature is essentially random read I/O anyway, I'm hoping it won't be too bad, but time will tell.

As we already do compression in the app, ZFS gives us a compression factor of only ~1.1x on these MongoDB databases. For normal databases (e.g. the configdb) and home directories we get a 2x - 10x compression factor.

Edit: although the setup is new, we've put >8TB of data into it, and soak tested full I/O bound reads for a few days with nothing blowing up.

Comment by Ben McCann [ 28/Mar/14 ]

James, did you benchmark performance with compressed ZFS at all?

Comment by James Blackburn [ 28/Mar/14 ]

We've got a large MongoDB instance running on top of ZFS on linux using lz4 compression. It seems to work well - at least we haven't had any problems related to the filesystem.

Comment by Mete Kamil [ 14/Feb/14 ]

Any news on this as far as compression option for MongoDB?

Comment by Raj Bakhru [ 23/Sep/13 ]

For those that need a solution sooner (as we did), TokuMX was a near drop-in replacement that dropped our data size to about 15% of what Mongo took to store it (and it's free/gpl2). We've been quite impressed by it. We had to alter/remove some map/reduce we used and cursor counts, but otherwise performance has been on-par. Would suggest trying it out

Comment by Eliot Horowitz (Inactive) [ 23/Sep/13 ]

Khalid: that's odd. Sounds more like a bug with NTFS than a "bad idea".
File system compression should be completely transparent to any users.

Roger: Mostly zfs. A variety of operations systems. zfs on linux while not in the kernel, is not considered stable by the authors.

We definitely plan on building this into the server, but will not be done until we do a new storage engine, which will be in one of the next few releases, but definitely not the next one.

Comment by Khalid Salomão [ 22/Sep/13 ]

Hi Eliot, I had a bad experience of corruption when I used filesystem compression on Mongodb data files on NTFS. After some investigation, I discovered several articles (from Microsoft and other sources) discouraging this kind of usage.

Is there any plan to implement this great feature?

Comment by Roger Binns [ 22/Sep/13 ]

@Eliot: for Linux, which filesystems are used? XFS and ext4 don't support compression, and btrfs does but files have to be COW. There are some reports of zfs working but it isn't formally supported on Linux.

Comment by Eliot Horowitz (Inactive) [ 22/Sep/13 ]

Khalid, many people use MongoDB on top of filesystems that use compression very successfully.

There are many good options, so this should be considered a very viable option.

Comment by Khalid Salomão [ 21/Sep/13 ]

Compression is a must! There are very good open and fast algorithms like Lz4 and Snappy

August, the problem with filesystem compression in NTFS is that it should not be used with memory mapped files like mongo uses... It could led to corruption...

The best option would be to have this implemented on the MongoDb core.

Comment by august bering [ 21/Sep/13 ]

Depending on the file system you use, the space really used on the drive might be much less than the file size (I was taken by surprise by this on an ext3 fs), so make sure to see how much space is really used. The db files initially contain a lot of zero padding, which might be optimized away by the filesystem.

Anyway, se https://jira.mongodb.org/browse/SERVER-863 mentioned by the previous post for a java driver that could potentially save you a lot of space as well as bandwidth.

I also experimented with my own compressed filesystem, tailor made for mongodb: https://github.com/augustbering/ABFS
Of course, this is only relevant if you're on linux, but NTFS might have some compression scheme that plays well with large files, I don't know.

Comment by Stuart Johnson [ 21/Sep/13 ]

Me too. SSD drives are not all that big.

https://jira.mongodb.org/browse/SERVER-863 will also make a big impact on space for some.

Comment by Roger Binns [ 21/Sep/13 ]

It is the one glaring problem for me. My collections use almost twice the space as doing a JSON mongoexport of the same collections. About 410GB in mongo storage, 230GB as JSON, and 8GB as 7zipped mongodump backup. Not only are the values compressible, but the same keys are are used across all documents.

This amount of storage consumption means that far more has to provisioned for mongo, it results in more I/O activity, backups take longer, concurrency is reduced (more memory and I/O etc). ie it has knock on effects to most other parts of mongo.

Comment by Pius Onobhayedo [ 20/Sep/13 ]

Any roadmap yet for built-in compression support? I think that it is long overdue.

Comment by Raj Bakhru [ 28/Jul/13 ]

We just tried out TokuMX, which modifies the storage mechanism of Mongo but externally appears the same as Mongo (so no application layer changes), and our database size on disk dropped from about 250gb to 25gb.

Anyone know of any Toku pitfalls? Otherwise would be great to see their work (open source) absorbed into Mongo.

http://www.tokutek.com/products/tokumx-for-mongodb/

Comment by Raj Bakhru [ 28/Jun/13 ]

Is there any update as to timeline for this feature? Or any implementation details (will there be field name tokenization, snappy or zlib value compression, etc?). It seems like the thread has gone quiet.

Also, I find it incredible that this is marked as 'Minor' priority. This is the biggest pitfall of Mongo right now.

Thanks!

Comment by Alex [ 08/Jun/13 ]

It would be nice to have column based compression

Comment by Daniel Petrak [ 06/May/13 ]

It could be nice to at least have a --compress option in mongodump.

Comment by Vackar Afzal [ 05/Apr/13 ]

Compression of arrays would also be amazing. I'm currently using MongoDB as a columnstore, with a document per column model. However, as we move into millions of values per column, we hit the 16Mb doc limit. I've found that compression (especially for text) can reduce sizes from ~200MB to about 5MB. My current solution is that when we hit the limit we do a compression and store to the compressed version as a blob. The ideal solution would be to have compression supported in database so I could take advantage of random to access (in database) to positions within the arrays themselves.

Comment by Stuart Johnson [ 29/Mar/13 ]

Tokenize the field names, as in this issue https://jira.mongodb.org/browse/SERVER-863
is bound to have a positive impact in reducing storage space for most use cases. Also means you can be more descriptive in the field names, without worrying about space. Vote for it.

Comment by Juho Mäkinen [ 20/Mar/13 ]

Even the compression of document keys would be very usefull. We originally tried to use as short key names as possible to save some space, but they make coding and understanding the data a lot harder. Now we'll just use descriptive keys and deal with the additional document size by paying bigger bills on disk space, because it results in less bugs.

Comment by Balthazar Rouberol [ 28/Feb/13 ]

Agreed, that'd be very useful.

Comment by Kevin J. Rice [ 25/Feb/13 ]

We're storing lists of floats. Actually, it's an array of tuples [ (timestamp1, value1), (timestamp2, value2), ...]. This takes 30 !! bytes per datapoint. It could take 8 (4 bytes per value). It would be nice to have some compression than would touch this, since frequently the value1, value2 are integers instead of floats, and nearly the same value each time period. A text zip would not work for this (but would be great in general).

In BSON, it would seem to be a win to have the ability to pre-specify that you're encoding an array of numerics in exchange for giving up the ability to throw a text value in the array. Attempting to add a non-numeric to this bson object would result in a failure to add.

I'm willing to help code this if I could have a pointer into where the code is, how to compile and create unit tests, and how to integrate this code into Mongo's append-to-array $add/$pushAll operator.

Comment by homerl [ 25/Feb/13 ]

Agreed,very useful.

Comment by Khalid Salomão [ 03/Dec/12 ]

Compression would be very good to reduce storage cost and improve IO performance.

LZ4 is also a good bet. Its performance is similar with Snappy's but it also has a High Compression mode.

Comment by Dwight Merriman [ 07/Nov/12 ]

so (perhaps this is consistent with Jeremy's comment) one model would be to compress not documents, as they causes potentially overhead if you want to then change a tiny region like an integer, but rather to compress very large fields. imagine if you have a BinData or string field that is 16KB in a doc, and that field (only) is compressed. just an idea.

btw MongoDB already uses the Snappy compression library to compress the the data it writes to the journal files.

Also in v2.0, there was a lot of work done on "compacting" b-tree index keys. I say compacting as it is a little different than large scale compression as the way they are represented they are still range-comparable: just a more efficient format. But the indexes in v2.0 tend to be 25% smaller than v1.8 so that is a good improvement especially once taking into consideration that if anything it is cpu-wise faster on them in 2.0.

Comment by Nick Gerner [ 30/Aug/12 ]

I believe that HBase (and certainly BigTable) uses on disk compression to improve performance. Something like Snappy (or LZO) has excellent (small) CPU utilization. Sounds like Cassandra has compression too. Some traditional SQL databases (e.g. innodb in MySQL) support compression. So this seems like an issues that makes Mongodb less competitive.

And in many applications ideal for MongoDB you're not doing a ton of random writes anyway. So the extra cost of re-writing and compressing blocks of records is a non-issue.

Comment by Ben McCann [ 11/Jun/12 ]

Agreed that an option per collection makes sense. I'm more concerned about reads than writes for my particular workload, so I don't really care that I'd have to re-write more. An option that the user could leave off if they have an update heavy workload seems like a fine solution. If I can compress 50% then I can use half as many Linodes or Rackspace Cloud machines and so my hosting bills get cut in half. This is a really huge need for me because it has such a large financial impact on my company.

Comment by unbeknownst [ 16/Apr/12 ]

Compressing documents could be an option when creating the collection, in a similar way capped collections work, for example:

db.createCollection("mycoll",

{compressed:true, format: "gzip"}

)

Comment by Eliot Horowitz (Inactive) [ 16/Apr/12 ]

One of the big problems with compressed documents is that if you modify a single field, you have to re-write a lot more. For append only data set its clearly better. Where there are a lot of udpates, its trickier.

Comment by Ben McCann [ 16/Apr/12 ]

When Cassandra added Snappy compression (http://code.google.com/p/snappy/) they found that it actually saved CPU cycles because dealing with smaller data meant that they could get it onto and off of disk faster.

Comment by Hector [ 24/Jan/12 ]

+1 on LZO here.

Comment by august bering [ 09/Dec/11 ]

See this comment for some real numbers https://jira.mongodb.org/browse/SERVER-863?focusedCommentId=71900#comment-71900

Comment by Eliot Horowitz (Inactive) [ 02/Dec/11 ]

I've played with compression (and encryption) with zfs and worked well in some basic testing.

Comment by august bering [ 02/Dec/11 ]

Has anyone tried using file system compression? I've done some tests that show about 4/1 compression ratio for my data.

Comment by Chris [ 10/Nov/11 ]

Sorry MongoDB team, but the priority of this ticket is way too low! You claim scalability? Well, size is one factor.

Please increase the priority of this option for people who (like me) run out of space.

Comment by Michael D. Joy [ 19/Aug/11 ]

The way oracle handles this is transparent to the database server at the block engine level. They compress the blocks similar to how SAN store's handle it rather than at a record level. They use zlib type compression and the overhead is less than 5 percent. Due to the IO access reduction in both number of blocks touched, and ammount of data transferred, the overall effect is a cumulative speed increase.

Should MongoDB do it this way? Maybe? But at the end of the day, the architecture must make Mongo more scalable, as well as increase the ability limit the storage footprint.

Comment by Christopher Price [ 09/Aug/11 ]

Ingres VectorWise touts a snazy on-chip compression/decompression approach to performance optimization. Something that by necessity would need to be server side. Would love to see something akin to that here. I'm currently stuck with overly verbose field names and could use all the help I can get. I would think that uncompressing data in memory (chip-memory at that) would allow at least 5x storage capacity (RAM and hard drive) and still be flying tons faster than than having to go to disk for the other 4x. Not to mention the ability to keep 5x data in RAM.

I don't really care how it gets done but would love to see this feature soon.

Comment by csbac [ 14/Jul/11 ]

Hi!
We would also be very much interested in this feature. In our case, each BSON document has about 25kB of data - short 4 char field names, and double values (we'll probably replace the doubles with fixed point ints (there are not 2 byte shorts, are there?), and a few thousand of those entries per document.
It's recorded process data.

To implement such a feature, where abouts would I have to look in the source code?
AFAIK, the data is stored directly as memory mapped file. Another level of indirection would be necessary keep the BSON in a temporary space after receiving from the client, and parsing it (for index generation and such). Then, before saving it, it needs to be compressed.

When fetching, the only difference would be that the data has to be uncompressed from the file into some temporary space, again, instead of directly accessing the BSON structure.
I expect the indices and document references are only stored to the beginning of the BSON document, so, the way the document itself is stored would not have such a large impact on the system ....?

If compressed on the client side, the server would no longer be able to understand the (non)BSON ... indices would have to be defined on uncompressed fields, and a query into the BSON would no longer be possible - what we usually need, is entering the data in a more or less fixed time interval;
but querying over a time interval, with only one or two of the thousand values needed. MongoDB is great because we can do this, but after client side compression, this would no longer be possible.

Well, I'll have a look at the code,
Yours,
Sebastian

Comment by Swapnil Tailor [ 21/Jun/11 ]

We have our normal data available in mongodb and looking for storing data older than few months on another instance of mongodb in compressed format.
Seems like a very needy feature, which will help many people to have archiving of their existing mongodb data.

Comment by Karoly Horvath [ 10/Jun/11 ]

I would like to see field name (dictionary key) compression:

Provide some kind of an interface where users can enumerate all the available field names (including field names in embedded docs).
With this technique all field names could be stored as integers.. in my db I have less than 256 different field names so a byte would suffice

The proposed document compression could be implemented on top of this (if needed/enabled)

I have a feeling this technique would be more efficient for small document sizes.

Comment by sampathkumar [ 26/Apr/11 ]

Hi Mongo, developers, I am still awaiting for your compression feature.Please let me know the status...

Comment by Andrew Armstrong [ 14/Apr/11 ]

Google recently open sourced its internal compression system it uses 'snappy' that favors being super fast over necessarily compressing the best.

May be of interest, http://code.google.com/p/snappy/

Comment by Nathan Ehresman [ 14/Apr/11 ]

We have a few situations where this would be incredibly useful. Has there been any more discussion and thought about this? What would it take to implement this? Any estimate as to how difficult it would be?

Since MongoDB performs fantastic as long as the working data set can fit in RAM, it seems to me like this would be a very valuable feature.

Comment by Roger Binns [ 12/Oct/10 ]

See also http://github.com/antirez/smaz which is designed to work on short text strings. Effectively it has a precomputed dictionary based on normal English text.

One advantage is that the same text will always give the same compressed results whereas other algorithms would depend on what they had seen before, or have too much overhead for short strings. Consequently this is a good algorithm for key names and text fields if the Mongo implementation compresses those individually rather than entire documents as a whole.

Comment by unbeknownst [ 09/Oct/10 ]

I would suggest having a look at using Blosc for compressing the data: http://blosc.pytables.org/
It is already in use by another high performance database: http://www.pytables.org/

Comment by yjl [ 09/Oct/10 ]

I think we not only need the text content to be gzip compress,
but also need Zippy or BMDiff column compress like google's bigtable .

Comment by Eliot Horowitz (Inactive) [ 28/Aug/10 ]

There are 3 things that could happen on updates

  • modify a single field in the document without changing size. $inc for example, we don't parse entire doc and re-save, just modify the write bytes FASTEST
  • load doc, modify, save to same place, so no index changes FAST
  • load doc, modify, have to save to different plance, SLOWER
Comment by Thilo Planz [ 28/Aug/10 ]

"Just compressing the values would be fine and quite efficient the data set that spurred this request in the first place (compressing the full document as one block would break the inplace updates and all that)."

How does inplace update work now? I was assuming it was replacing the complete document, reusing the same storage location. If so, compressing the full document would not break this.

Does inplace update really update values inside an existing document? This seems to be possible only for changes to fixed length data (like integers) and also only for the atomic update modifiers (otherwise you'd not know what has changed).

So, except for having to uncompress the data to "reach into the document" for filters (increased CPU cost), does this have any other negative impact? Access by index should work just the same as it does now, and if the drivers support client-side decompression, one would also save on network i/o.

Comment by Leon Mergen [ 15/Jul/10 ]

This feature would be very nice, especially if the decompression would happen at the client level. This would make storage requirements, RAM usage and data transfer speed more efficient.

Comment by Ask Bjørn Hansen [ 04/Jun/10 ]

While I wrote "gzip compressed" originally; LZO as Jeremy mentioned would be great.

Just compressing the values would be fine and quite efficient the data set that spurred this request in the first place (compressing the full document as one block would break the inplace updates and all that).

Comment by Jeremy Zawodny [ 17/May/10 ]

I'll add my 2 cents here too. We deal with a lot of text that's easily compressed. I suspect that with LZO we'd easily see 4:1 compression on our data. And for the dataset I'm sizing right now, that'd be the difference between ~5TB and ~1-2TB to store everything. Even though we'll shard the data, it'be nice to reduce the footprint. It'll certainly not be CPU bound in this particular workload.

Comment by Andy [ 07/May/10 ]

This is a very important feature.

SSD is getting more and more common for servers. They are very fast. The problems are high costs and low capacity. A 64GB X25-E costs $800.

If MongoDB could compress data to 25% of its original size, it'd be like getting 4 times bigger SSD for free.

Data compression could also help to keep database size small enough to fit in memory. Another huge performance boost.

Comment by Tsz Ming Wong [ 06/Mar/10 ]

Although data can be compressed by the client, it would be nice if mongoDB handle this automatically.

Comment by Joseph Turian [ 01/Feb/10 ]

Agreed, being able to have compressed fields would be very useful.

Generated at Thu Feb 08 02:53:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.