[SERVER-863] Tokenize the field names Created: 02/Apr/10  Updated: 22/Jan/19  Resolved: 22/Jan/19

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Karoly Negyesi Assignee: Geert Bosch
Resolution: Won't Fix Votes: 304
Labels: None

Issue Links:
is duplicated by SERVER-3288 Symbol table for attribute names Closed
is duplicated by SERVER-9247 Small modification to Data storage sc... Closed
is duplicated by SERVER-7402 optmization on key in 2.2 Closed
is related to SERVER-3536 templates for schemas Closed
is related to SERVER-164 Option to store data compressed Closed
Sprint: Storage NYC 2019-01-14, Storage NYC 2019-01-28


Most collections, even if they dont contain the same structure , they contain similar. So it would make a lot of sense and save a lot of space to tokenize the field names.

When the client connects to the DB, the DB could send the token table to the client. It surely won't be big. When the client needs to serialize a field it does not have a token for, it assings a random 32 bit value to it and sends the token along with the command.

Of course, this could be optional.

Comment by Romy Maxwell [ 21/Nov/10 ]

I also think this would be a good addition, as I find myself shortening key names to cryptically short lengths for precisely this reason.

Comment by Hampus Wessman [ 01/Dec/10 ]

Most collections have objects of similar structure, but not necessarily all. I think it's important to still support the use case where there is a huge number of different field names in the same collection. That's not a problem if it is optional, of course, but I'm not sure it is worth the added complexity anyway. Perhaps if this was done only when several objects were sent and each time new tokens were used, but then it may be easier to use some other kind of compression instead.

Comment by Andrew Armstrong [ 09/Dec/10 ]

I think this would be a useful addition to both save storage space but also RAM usage while things are in memory, getting more performance and storage out of a MongoDB instance.

This would also free developers from worrying about their document field name lengths (which ends up leading to harder to read code).

An option would be to maintain a local hashtable of Field Id <-> Field Name that is global only to that particular shard. Different shards would maintain their own local map which would possibly have a Field Id <-> Name combination different to other shards for the same field name, which is fine.

When a client performs a query, they always specify the field names every time.

When the driver is about to send the request to MongoDB, it first checks to see if it has a local map of Field Name <-> Field Id for that particular query's fields.

If a local map exists, the driver transparently changes the query to specify Field Ids instead of the Field Names. If no map exists, Field Names are sent instead, along with a flag to indicate the driver supports Field Id substitution.

The driver should then receive the requested documents, but having indicated to the server that it supported Field Id substitution, a partial result of the Field Id <-> Name hashtable is sent back (the necessary fields only) and the results have their Field Name's substituted already, to save on network bandwidth.

The driver remembers this hashtable for the current session, so in future requests it can substitute the field names for Ids when issuing future queries.

When storing documents in RAM/disk, let the Field Id be persisted to disk and the Field Name is maintained in the hashtable. Keep the hashtable in RAM of course to ensure fast lookups of Field Id <-> Field Name.

The overall benefit as mentioned by other users is that you reduce the amount of storage/RAM taken up by redundant data in each document (so you can use less resources per request, hence gain more throughput and capacity), while importantly also freeing the developer from having to pick short and hard to read field names as a workaround for a technical limitation.

Best of all; such a change would allow existing deployments to start reaping the benefits as soon as possible.


Comment by Srirang G Doddihal [ 02/Feb/11 ]

Would it make sense to optionally allow creation of collection with "schemas" and save the space for keys in such collections?

This probably sounds totally against the Mongo philosophy, but I am sure there a sufficient number of users who are actually storing same types of documents in a collection and for whom this actually makes sense.

This wouldn't introduce any restriction and anything people are doing currently will still be possible.

Should this be a different ticket?

Comment by Tilo S [ 05/Feb/11 ]

A suggestion on how this could be implemented, and some Ruby code:


Comment by Remon van Vliet [ 18/Feb/11 ]

This would add a lot of complexity and keep in mind that if people have a fully predetermined schema they might not have landed on MongoDB as their persistence solution in the first place.

Either way, there are probably two routes here :

1) Allow for an optional schema definition that allows mongodb to tokenize all fieldnames up front and do simple fieldname conversion driver side. This would have to be defined on collection creation.
2) Let MongoDB dynamically manage the fieldname dictionary as fields are added/removed, etc.

The latter option is completely transparent and a definite space saver but I'd be very worried about insert/upsert performance. Option 1) is not that compatible with the mongodb schemaless philosophy in my opinion but it wouldn't have dramatic performance issues.

Comment by Tomasz Nurkiewicz [ 01/Apr/11 ]

I would recommend managing field names dictionary per collection only on the server side and leave client-server wire protocol backward compatible (sending field names dictionary to the client might a be an idea for the future). So instead of storing a field names several times, 1 or 2 bytes would be used as an index in the dictionary. One byte should be enough for most of the documents, while I can't think of any document exceeding 64K different field names.

To further compact the data 1 byte might be used for lower 127 fields and 2 bytes (UTF-8-like encoding) for fields with indexes above 128 (highest bit of first byte decides). One can even think of pushing most frequent fields at the beginning of field names dictionary.

Keeping the changes only on the server side for now wouldn't affect existing clients/drivers while providing great speed/capacity boost. I can easily imagine database where descriptive field names occupy 50% or more database space. On the other hand shortening field names into cryptic, 2-3 letter codes is no advice.

IMHO indexing field names and storing them once only in dictionary per collection is completely orthogonal to document schema if they will ever be introduced. This is just a internal and transparent storage optimization.

Comment by Andrew Armstrong [ 02/Apr/11 ]

Yes; I'd also expect to see this would give you more performance (more useful data in RAM vs. duplications of field names etc) and more 'bang for your buck' when buying more memory.

Comment by Alexandru Ioan Turc [ 30/Jun/11 ]

I find myself in the situation of eventually selecting mongodb as a storage system for an enterprise level application. I'm doing some testing and things works pretty well, except the fact that the system is memory hungry, actually much more than I expected. Something around the idea suggested by Andrew Armstrong / Tomasz Nurkiewicz would cut down on memory requirements quite a bit.

I'm inserting documents around 730 bytes each. I inserted 4 million of such documents, which is not a lot. I'm expecting to have maybe something like 200,000 documents of this size every day. This means that I would have 4 million such documents in about 20 days. If the system runs only during business days, this means 4 mil documents / month.

It inserted at a speed of 2800 docs / sec, which is ok given that my test server is running in a VM. But the data size got to 3 GB. If I have a replica set, this means I need 6 GB of RAM per month. Per year this easily means 72 GB.

There is a lot of redundant data stored in documents, both for field names (each documents has pretty much the same field names and in my test they had between 1 and 3 chars) but field values as well. Actually most of the strings in my documents are repeating information - and I would not be surprised if other people would be in a similar situation since data is de-normalized compared to a RDBMS. Of course, those strings can be encoded as numbers at application level, but this would make interaction with the database very difficult and error prone, and almost impossible form a console.

So a basic dictionary, system, even implemented at server level in the first stage without involving the driver (which would be another great improvement but not critical) would make the entire system much more attractive.

I actually made a test, replaced all string values with integers (except key names). The average object size went down to 491 bytes, insertion speed went up to 3300 docs / sec, and total data size got down to about 2 GB. If I have a replica set this means I need 4 GB of RAM per month. This implies that I can save about 2 GB / month, or 24 GB a year with just a simple dictionary. Which is 33% improvement. There will be some impact on speed, due to an indirection, but if using a dictionary or not is a per collection feature, once can chose.

Also, if key IDs are encoded like Tomasz suggested, most of them will be actually encoded as 1 byte (since for each large collection the number of fields will be quite limited). This, in my test would mean a saving of about 400MB a month or 4.8 GB a year. This is another 6%. So memory requirements can go down by 40% with a few not so complicated optimizations.

Now if the dictionary is available at drivel level which does the mapping, one would expect a decrease in network bandwidth and eventually even higher insertion / data retrieval speed. But with today's network speeds this is probably really not a critical issue.

Comment by Paul Harvey [ 21/Jul/11 ]

This is a crippling artifact. We've also had to implement a hash dictionary for database names, due to case-insensitivity and 100-byte limits there.

This is not something an app developer should have to worry about. MongoDB should be doing this stuff for us.

A colleague at my workplace tested MongoDB against Cassandra (they were looking for better write performance). While MongoDB performed extremely well, they just couldn't stomach the almost 10x increase in disk space required: their field names were much longer than the data they were storing.

I am perplexed at the comments expressing concern over the schemaless use-case. That's exactly why we came to MongoDB, and we've expended a lot of effort making it work.

Yet, we can't sort any result more than a few thousand records unless there's an index on the sort key. And we have a limit of 64 indexes per collection, each one taking up more and more memory. So I'm in a situation where we have "schemaless" data in the true sense of the word, we can't predict how users want to view their information, and for any serious application you need paging of results, which means sorting... in other words, I wouldn't worry too much about the 'schemaless' use-case, because MongoDB is hardly catering to it at this point.

Comment by Flavien [ 24/Jul/11 ]

I agree with all of this, this is an easy way to dramatically reduce the size of MongoDB collections, improve performance, scalability, etc... This can be implemented so that it remains completely backwards compatible (i.e. the tokenization is only internal to the engine).

Comment by Martin Lazarov [ 03/Aug/11 ]

I agree with Tomasz Nurkiewicz.
Adding collection keys to dictionary will dramatically reduce the size of MongoDB collections and improve perfomance.

Comment by Romy Maxwell [ 05/Aug/11 ]

Lots of agreement, but this issue has been open since Nov 2010. Is anything ever going to happen on this front ?

Comment by Mike [ 28/Oct/11 ]

Mongo staff, please provide an update as to your thoughts on this issue and ideally when you'll assign the task to someone.

We are evaluating Mongo for several activities, possibly instead of using a RDBMS, however this issue has been hard to swallow.

  • 50% increase in database size (vs rdbms) estimates are kind. If we actually name fields with proper "descriptiveness", we see the possibility of 100% DB size increases (And even higher). This kind of inefficiency is un-acceptable. Think of a document/model with 20 fields, each 15 chars long on avg. 300 extra bytes + per document for field names is hard to swallow. For example, imagine configuration parameters for some application, where you might ideally have descriptive names such as "moduleWidgetNetworkTypeSlow" (27 characters), with values between 1 to 5 characters... Field names can be many times the size of their actual values.
  • So if we proceed we must use cryptically short field names and rely on mapping within the application. Unfortunately, ad-hoc inspection via the console and other means will suffer.

Transparent dictionary tokenizing would be ideal. It should probably be handlded on the server, however please start a dialogue if it needs to be discussed.

Comment by august bering [ 06/Dec/11 ]

I'm thinking of having a go at this myself, but there's always the sharding issue... I guess it might be possible to put the field dictionary in a collection which is read by each shard on startup and then held in memory. I don't know if this is possible but I'll have a look at the code to check it out.

Comment by august bering [ 06/Dec/11 ]

Meanwhile, an alternative is to put the database files on a compressed file system, which is what I'm doing right now. For my data I get a 75% decrease in diskspace using LZO compression. No significant performance penalty.

Comment by august bering [ 09/Dec/11 ]

I originally tried using the excellent fusecompress filesystem, but unfortunately the frequent changes in all those 2GB datafiles seemed to slow down the system to a halt occacionally. I suspect that's due to some defragmentation of the compressed data. Anyway, I decided to build my own fuse filesystem optimized for large database files, and here it is: https://github.com/augustbering/ABFS

For a set of databases that took a good 65GB uncompressed, the same files put on ABFS is only about 8GB. MongoDB files are often very sparse (zero filled), so depending on your particular setup you might gain a lot less or more. Having a look at the mongodb stats() command output, my actual data seems to be compressed about 4 times, but as mongodb allocates disk space in big blocks (up to 2GB) you might get a compression factor of 10 or more (I've seen up to 30/1 ratio).

As for speed, there's no measurable impact for my write tests (inserts and mapreduce) when journaling is turned off. Reading is actually slightly faster on a compressed system.

Comment by Alexander Nagy [ 25/Feb/12 ]

Huge +1, this is our number 1 pain point. Kills code maintainability but the performance issue is too large to ignore. Could be transparently implemented on the server side.

Comment by Reuben Garrett [ 01/Mar/12 ]

another compelling use-case is aliasing field names to conform documents to some downstream "interface". for example, i can define a function:

Long increment(DBObject obj){
return ((Long)obj.get("target"))+1;

... and feed it with a query that aliases "mySpecialField" to "target", decoupling the definition of increment() from the definition of the documents i want to process.

Comment by Alexander Nagy [ 28/Sep/12 ]

Description is overly prescriptive. Tokens do not necessarily need to be random, nor do they need to be 32bit integers.

For example, using LEB128 (http://en.wikipedia.org/wiki/LEB128) would for most objects reduce the overhead per field to 1 byte (instead of 4 if 32bit integers are used).

Comment by LeoN Nortje [ 28/Feb/13 ]

Just another vote for this - descriptive field names are a huge plus in terms of readability. In fact, almost non-negotiable.

I would prefer a standarised tokenization method applied on the driver side. Implementing this in the drivers saves on I/O and prevents adding a computational burden on the server side. This should fit with the Mongo philosophy of doing grunt work on the client side, similar to what is done with the __id. And of course this prevents adding complexity to Core Server code.

Drivers can maintain a single special per-collection dictionary object (__collection_dictionary or whatever) to synchronise substitution between clients.

This may feel slightly hacky, but is definitely the lesser evil compared to turning all our field names into line noise.

So -

  1. pick a tokenization format that lends itself to a lowish-overhead client side implementation (LEB128 as per the previous poster seems as good as any, i'm no expert)
  2. define the structure, naming and handling of the special collection dictionary object
  3. let the driver writers get cracking
  4. world peace?
Comment by Francis West [ 10/Mar/13 ]

Bump = LeoN Nortje hits the nail on the head. Server-side collection specific dictionary seems appropriate. Transparent to us clients. What would be the downside to that?

Comment by Ben McCann [ 10/Mar/13 ]

I agree. This is hugely important as it would have real financial impacts on everyone using MongoDB. Fixing this would mean you could run the same MongoDB cluster with less disk and less RAM.

Comment by august bering [ 16/Mar/13 ]

I thought about trying to implement this feature as a proxy-server between the client and server, thus no rewrite of driver code. What do you think about that?

Comment by Ben McCann [ 16/Mar/13 ]

August, I don't think that's a great solution. Some of the drivers such as Java's Morphia already have support for this. A proxy server introduces extra latency by adding a network hop. Also, it wouldn't fix the fact that the shell needs support for this, which is the real problem today.

Comment by august bering [ 16/Mar/13 ]

I don't quite follow, doesn't the shell access the server through the standard socket API? And what kind of support for this does Morphia have? I don't find anything in the docs.

Comment by Ben McCann [ 16/Mar/13 ]

You could point the shell at the proxy server, but it doesn't negate the fact that running an entirely new process is a big hack compared to just directly fixing the mongod or shell. With Morphia you can annotate a field to use a shorter name when serializing. E.g. @Property("a") would serialize the field with "a" as the key.

Comment by Christopher Price [ 16/Mar/13 ]

As this ticket nears its 3rd birthday, I would like to point out that it is the 8th most voted on server ticket in the 10gen queue. It has sat in "planned but not scheduled" purgatory for 18 months. Is there really any chance that this feature will ever be implemented? Or should we all just plan on biting the bullet and migrate all of our data (and code) to less verbose and more storage friendly representations.

I would suggest that any ticket this highly voted on (or higher) should be taken more seriously and be given a more explicit schedule for implementation than "planned but not scheduled". Or stop teasing us by leaving it open.

Comment by Eliot Horowitz [ 17/Mar/13 ]

The age is not a reflection of its importance.

In 2.6, we're working on an indirection layer between BSON and the storage layer that would let us do a number of interesting things, including this and compression.

Once we have that in place, then we have a couple of ideas on how to do something a bit tighter than just tokenization to get in cache and disk representations as small as possible.

Comment by Francis West [ 17/Mar/13 ]

The age and activity on this thread are both reflections of the importance to the community
great re: 2.6

Comment by Christopher Price [ 18/Mar/13 ]

I understand that tokenization is one potential implementation for a feature that is really "support long field names in a storage friendly manner". Could you link the related stories from 2.6 and briefly comment on the additional work required and possible timeline of when we would see something tangible?

Comment by Ben McCann [ 19/Mar/13 ]

+1 for linking related tickets. Also, curious if compression / field name minimization is something that would make it in 2.6 or if it'll just be the indirection layer for this next release and not anything depending on it.

Comment by Stuart Johnson [ 28/Mar/13 ]

Big +1 for this. Single most important feature for us. Field names are such a waste of space. For our largest collections we use single character names, which is not ideal.

Comment by august bering [ 03/Apr/13 ]

Thinking some more about this I get increasingly convinced that implementing this as a proxy server would actually be the best solution. It would lower bandwidth (if the proxy is located at the client machine), not add any complexity to server or driver code base and be completely optional. Granted, there would be some overhead when adding an extra socket hop, but the extra work would be performed by a different core (and possibly on a different machine) than the mongodb server process.

Breaking out this function to a separate module just seems the cleanest way to go. And not that hard either, I'm thinking of modifying the nice https://github.com/recht/mongodb-proxy to try it out.

Comment by Stuart Johnson [ 03/Apr/13 ]

Nah, we dont want another point of failure with a proxy. There are two parts to this. Server side Tokenizing, and client side tokenizing. Sever side benefits everyone, and will have an immediate impact to data storage size. Client side, reduces bandwidth, and can be optionally switched on, but requires the drivers to be updated.

Comment by Reuben Garrett [ 03/Apr/13 ]

Stuart Johnson - not sure if this is already part of how field tokenization would work, but it would be quite elegant if the client-side tokenizer could detect server-side support and bootstrap itself using the server's mappings.

Comment by Stuart Johnson [ 03/Apr/13 ]

That's how I imagined it would work. Client connects. Client optionally instructs the server that it is "Tokenizer aware". Server sends down it's mappings.

Sever side, the mapping table is used to save storage space, regardless if it is used by the client or not. Quite a significant improvement to storage space I would imagine, and no more worrying over how long your field names are.

Comment by Eliot Horowitz [ 04/Apr/13 ]

One problem with simple tokenization is that the possible set of field names is massive.
So the solution we're working takes that into consideration but requires it be server only.
While reducing network bandwidth would be nice, the key goal is reducing memory/disk size of long field names & documents.

Comment by Dave [ 06/Apr/13 ]

Yes, the tokenization should be transparent to the client - only in server side.

This is a nice js class I found that implements the tokenization.


IMO, the tokens should be on collection level.

Comment by august bering [ 07/May/13 ]

I made an experimental java driver (fork of the official) with support for field name translation. Feel free to try it out: https://github.com/augustbering/mongo-java-driver

No support for map/reduce right now, but that might be pretty easy to add if people find it useful. Have a look at the wiki for details.

Comment by Christopher Price [ 01/Nov/13 ]

Based on the comment from Eliot Horowitz on March 17th 2013, I am looking forward to seeing the related tickets coming out in 2.6 that will help support this feature.

Comment by Khalid Salomão [ 19/Nov/13 ]

This is simpler and (in some ways) better than compression. Do you guys have any update on this feature?

Comment by august bering [ 20/Nov/13 ]

You should check out the java driver I mentioned above if you use java. The java driver team are also planning something like this for their next official release.

Comment by Ben McCann [ 20/Nov/13 ]

There should be some standard so that the shell and other drivers can implement the same thing

Comment by Khalid Salomão [ 20/Nov/13 ]

august bering Thanks, but I was inquiring about the MongoDB server support for this feature.
This kind of feature should be implemented on the server for better performance, stability and support for the wide range of technologies that uses MongoDB.

Eliot Horowitz about the possible huge number of distinct field names, I would suggest also some kind of threshold (like 500), after which the server stops the tokenization for the referenced collection. This could lead to a collection with a mix of tokenized and non-tokenized fields that would have to be delt with... but would avoid the problem massive distinct field names and keep the transformation tables small to be kept in memory...

And this would be nice as an collection option that could be enabled by default...

Aside my babblings above, do you have any update on this feature?
Do you have

Comment by Christopher Price [ 20/Nov/13 ]

What is the ticket number for the java driver feature you mention. I would definitely like to follow that ticket too.

Comment by Francis West [ 20/Nov/13 ]

There's still no update: planned but not scheduled?

  • I think we can agree many people are actively asking for this.
  • It affects memory & storage (and therefore performance) - things app developers really worry about.
  • A lot of devs (I'm guessing) are being forced to re-write their document keys into unreadable shorthand as a result. We'll all pay for this in the future.
  • This shouldn't be that huge of a problem. A collection specific name lookup, transparent to client.
  • Can you at least re-review this, give us a real definitive timeline, or let us move on?
Comment by Christopher Price [ 11/Apr/14 ]

About a year ago you said:
"In 2.6, we're working on an indirection layer between BSON and the storage layer that would let us do a number of interesting things, including this and compression. Once we have that in place, then we have a couple of ideas on how to do something a bit tighter than just tokenization to get in cache and disk representations as small as possible."

Considering that 2.6 was just released, were you able to incorporate the indirection you described and if so, are the original ideas on how to do something like this a little more concrete?

Comment by Eliot Horowitz [ 11/Apr/14 ]

Yes, we have the indirection layer in place now, which is the proper place to do something like this.

Still not 100% sure which solution for this we would do first.

One question for those interested is what size/shape documents do you have?
One option is to have an option for compressing full documents.
That would be good for mid-size documents (2k-64k) but bad for small documents.
So wold be helpful to know if that would be a good option for some people.

Comment by Mitar [ 11/Apr/14 ]

I think compression should be between documents themselves, not just per document. For example, we store many timeseries values as documents so field names are then duplicated many many times.

Comment by Jianbin Wei [ 11/Apr/14 ]

We faced the same issue as @Mitar. We workaround the problem by adding another translation layer to map between fields used by our applications and those in mongodb. The storage size saving is significant. In one case, the old one is 1.35GB and tokenized one is 860MB and in another it is 143GB vs 94GB.

Comment by Christopher Price [ 11/Apr/14 ]

The documents I care about average just under 3kb in size.

Comment by Mitar [ 11/Apr/14 ]

We tested compression using tokumx and they got 6 GB of time-series data to ~500 MB.

Comment by Ben McCann [ 11/Apr/14 ]

Compression on TokuMX got our storage down to about 20% of what it was with MongoDB. We benchmarked renaming our field names to single characters with MongoDB and that got the space down to about 70% of what it was with human-readable field names, which is a nice savings, but not as good as TokuMX does.

Comment by Thanos Angelatos [ 11/Apr/14 ]

Hi there. In our case, our documents average between 650b-950b but are in the billions (SMS marketing). In this case we would benefit extremely from a field tokenisation/lookup scheme, 50% of our storage are field names - we're trying to find creative ways to shorten field names while still keeping a relative coherence level, our goal is to fit as many documents as possible in the 4k page size. We currently fit 4-5 depending on size and our perf. bottleneck is mongodb.

Comment by Vincent [ 11/Apr/14 ]

AFAIK, full documents compression saves IO (a lot), not memory, with "high" CPU counterpart.
Tokenizing field names would save both IO (medium) and memory (medium), without any real downside (except maybe collections with very high heterogeneity among documents), very low CPU counterpart.

Ideal solution would be to have both available, with full documents compression "on demand".

Also I +1 with @Mitar, it's very common to have the same kind of document again and again in one document (timeseries, etc.). I think even MMS does this.

Comment by Jianbin Wei [ 11/Apr/14 ]

Average document size in our case is about 800bytes and we have about 30 fields.

Comment by Richard Smith [ 14/Apr/14 ]

For us average document size is around 260 bytes, field names are mostly short but still around 25% of document size.

Comment by Curt Mayers [ 21/Apr/14 ]

For my particular application, we are storing externally-defined XML objects, with verbose naming: the fieldnames are, perhaps, 70% of the total record size. So fieldname tokenization could be a giant win, both in terms of I/O and memory efficiency.

This could be accomplished entirely server-side, so it could be absolutely transparent to existing applications and drivers (if each database were able to tokenize up to 200 fieldnames, the name substitution in incoming queries and outgoing results could be done very efficiently).

Late last year, I attended a local MongoDB user group, and the MongoDB representative who was making the presentation said that he believed that field tokenization would be a part of the next major MongoDB release). Is that true, and is that still the case?

Comment by Pieter Willem Jordaan [ 10/Jul/14 ]

I personally think implementing it on server side will have more overhead for all scenarios. I think it should be handled client side per case. mongoengine for python has this ability and I know many others. I think this should rather be pushed to the driver level.

Comment by Karoly Negyesi [ 10/Jul/14 ]

I suggested storing the tokens server side (very very long ago) so that the shell can show you the right data and not some inscrutable numbers. If you want client side only, well your app surely has a CRUD layer and then it can do the tokenizing itself, for itself. That of course shuts off Interoperability immediately.

Comment by Pieter Willem Jordaan [ 10/Jul/14 ]

It being the shell, I believe it should still be client side implemented. Shell aliases.

Comment by Jianbin Wei [ 10/Jul/14 ]

@peter Are you saying pymongo has this capability already? If yes is there any document?

Comment by Curt Mayers [ 10/Jul/14 ]

I have to strenuously disagree about tokenizing fieldnames at the driver level.

One of the primary functions of databases is to share data between users and applications. Tokenizing fieldnames at the driver level would require absolute and precise synchronization of driver implementations (potentially between different languages and client platforms) for all clients of a database.

This can be easily and efficiently handled at the server level by using a dictionary for each table for which compressed fieldnames are being used. The client drivers need not change at all (or even be aware of the transformations). The database would be faster and more space and memory efficient (because each record would be smaller, and there could be more records stored per page). And because there need be only a single implementation of the fieldname mapping, in the storage engine itself, it would be a transparent and useful extension to MongoDB.

Comment by Pieter Willem Jordaan [ 10/Jul/14 ]

@Jianbin No I don't believe it is in pymongo. It is in mongoengine.

@Curt I can see that server side will be beneficial. No work needs to be done client side, so portability and interop will be better. However, I assume it will have about the same penalty that indexes have. Each write will have to update that index. Each query will have to query the token table (unless it is kept in RAM, but still it may be a lot of tokens).

Besides, unless you have some special use case where unmapped tokens are used client side, each new 'field' on client side will have a client side update anyway. So keeping the client in sync doesn't pose a problem, it is needed in any case.

Comment by Francis West [ 10/Jul/14 ]

@Curt - Have to agree with you, they'd be creating a mine-field if this was done driver level. As @Pieter states, there's a slight performance hit if the token table is not in RAM - but regardless, it's partially made up for in terms of less pages, less memory use in general.

Also, where not needed, surely people can switch the entire feature off.

Comment by Vincent [ 10/Jul/14 ]

To save even more space, a 16 bits token would probably be enough I think (65 536 fields possible). In my case, such feature would have dramatically positive effects and I can't understand why this hasn't been done from the very beginning of MongoDB. When you store numbers (like MMS probes), the field names always takes more space than the data itself!
Also it can't be done at the drivers level (it would require to save the hashmap somewhere in MongoDB for the others drivers or tokens would have to be a hash from the fields itself and then much less than 2^16 fields would be possible because of collisions).

Edit: the lookup could be made at the client side, with some support on the server side (returning and probably updating the hashmap for example).
But anyway the overhead would be exclusively on the CPU side, and very low (because it saves you memory, remember? a whole 65000 fields * (10 bytes + 2 bytes for the token) hashmap is only 762 KB)

Comment by Pieter Willem Jordaan [ 10/Jul/14 ]

By driver level, I don't mean that the driver keeps track of the mappings. I envision something similar to mongoengine (ODM). But the driver may make this wrapping easier by allowing the returned documents to be mapped to a user dictionary. Something like:

documents = collection.find({query_params}, {fields}, {mapping_optional})
# later on...
collection.insert(documents, {mapping_optional})

Personally for me this would be sufficient.

Lacking in this, is that by inspecting the database with some other tool, it will be completely obfuscated in a sense. In the shell one may easily write some form of a wrapper also with a script.

Application wise, it will be sufficient. As I mentioned earlier, application level code is needed anyway for 'schema' changes, but if all drivers support this sort of translation it will be easily remedied.

Comment by Vincent [ 10/Jul/14 ]

@Pieter Willem Jordaan => It won't work for people accessing the DB from multiple drivers, because it's impossible to keep the objects mapping perfectly in sync all the time accross them...

Comment by Francis West [ 10/Jul/14 ]

@Pieter Willem Jordaan This really can't have anything to do with drivers, it should be entirely transparent to clients. (For all the above mentioned reasons)

Comment by Pieter Willem Jordaan [ 11/Jul/14 ]

I've reconsidered everything and I completely agree with you now. It will be best to be transparent to the drivers and implemented server side. In the meantime one may use a wrapper as I've mentioned.

Comment by Vincent [ 22/Jul/14 ]

I just realize that what everybody wants is a serialization of data more or less like Apache Avro: http://avro.apache.org/docs/current/#schemas
So it already exists and works very well maybe MongoDB can use some of their implementation tips to release something awesome.

Comment by Chad Kreimendahl [ 23/Jan/15 ]

I'm wondering if this is significantly less important now that wiredTiger storage engine is out. We're seeing substantial compression because more than 1/2 of our "data" is the property names, themselves. We had done a form of tokenization, through serialization options in the C# driver are amazingly well done, and found that with or without our own shortening, we basically ended up with the same on-disk and in-memory usage using snappy compression in wiredTiger storage engine.

Comment by Pieter Willem Jordaan [ 23/Jan/15 ]

I was wondering about this myself. Ultimately the best compression is still
starting with the smallest data set. But if it is negligible then this
ticket becomes less useful.

Comment by Kevin "Schmidty" Smith [ 24/Jan/15 ]

If you happen to be working in Node, you can try mingydb.

Comment by Francis West [ 24/Jan/15 ]

Great news - if the storage engine's compression takes care of the problem (do we have some stats - with/without tokenised names?) - then I'll be happy to rely on that solution.

Comment by Dan Dascalescu [ 15/Aug/15 ]

Should this be closed now that compression has been implemented?

Comment by jonathanv [ 14/Oct/15 ]

The storage engine's compression only potentially takes care of this problem. If there are a lot of small documents in a large block size, the compression on fieldnames will be very noticeable. If there is one large document spanning multiple blocks, the compression on fieldnames will not be noticeable.

Comment by Johnny Shields [ 31/Dec/15 ]

This can be done in Ruby Mongoid by using field :foo, as: :field_full_name (where "foo" is the min-size token in the DB)

Comment by Francis West [ 19/Dec/18 ]

A solution like this could rely on hashing the field name rather than a random identifier. That way we have the ability to know what a field name id would be without a lookup, a rainbow (reverse hash) table can be kept to ensure the other direction is resolvable when needed.

Comment by Johnny Shields [ 15/Jan/19 ]

A hash in the traditional sense would defeat the purpose of small size here, because hashes are usually at least 16 bytes long.

What we would want is to store a reference table in the collection metadata. Each key gets a 1 or 2-byte key which is mapped by the reference table. All results and queries process keys using the reference table such that the internal representation is completely opaque to the user.

Comment by Brian Lane [ 22/Jan/19 ]

We have decided to close this issue as won't fix.

Many of the original reasons for wanting tokenized fields were related to storage size. Currently, data compression is always performed, and there is no evidence that there would be much improvement with tokenization. With our 4.2 release, we are also introducing additional compressor options at the storage layer.

While we could change in-cache data representation to allow for less memory use and/or faster search, there is a significant benefit in using the same BSON format and avoiding an extra copy when doing collection scans (including for initial sync or other cloning operations).

Feel free to comment on this issue to provide additional feedback to us.


Comment by Pieter Jordaan [ 22/Jan/19 ]

I've been following this closely, but ever since compression was built in, I was less interested. I'm happy with the new advances and don't really need this anymore. Besides, it might be better doing it application side anyways.


Thanks Brian

Generated at Tue Feb 19 23:04:15 UTC 2019 using Jira 7.12.1#712002-sha1:609a50578ba6bc73dbf8b05dddd7c04a04b6807c.