[SERVER-3288] Symbol table for attribute names Created: 17/Jun/11 Updated: 06/Dec/22 Resolved: 18/Dec/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Kenn Ejima | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Duplicate | Votes: | 8 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Storage Execution
|
||||||||
| Participants: | |||||||||
| Description |
|
We've seen lots of people making abbreviated attribute names like "rcv_uid" for "receiving_user_id" or "pw" for "password" in a document - to improve space efficiency where they could repeatedly appear in a large collection. This is kind of ironic because, one of the greatest aspects of document-oriented database is to have flexible and "intuitive" document structure. While we all like the idea of schema-free design, in reality, we actually NEED to have schema for better performance. Documents should be structured in a certain way, and indexed attributes are critical. Here's a big question: What if we had a global symbol table for any attribute names in the database? Possible values for attributes are unlimited, but possible "keys" are practically limited. If we map the keys using 32bit symbol table as follows: 0x0001 => receiving_user_id (17 bytes -> 4 bytes) and the persisted presentation of document could take up less space. The median of key length in my past projects is like 10-12 bytes (e.g. "achievement_id", "leaderboard_id", "max_version_id", "icon_content_type"), so it's a big deal. In some pathological cases where 1-3 byte keys are used it means slight increase in size of course, but practically it will be almost always a win. But the best part of this feature is change in mentality- we could stop worrying about keys taking too much space and start to use clear, descriptive key names for the document schema. As Phil Karlton said, "There are only two hard things in Computer Science: cache invalidation and naming things." Let's keep naming things non-restrictive. |
| Comments |
| Comment by James Gray [ 21/Apr/12 ] |
|
It is the same basic issue, but the proposition given is a little absurd. No matter the implementation, it should be completely transparent to the user AND the driver. |
| Comment by Ben McCann [ 20/Apr/12 ] |
|
Totally agree this is needed. Or some type of compression that would notice the same keys being used and be able to efficiently compress the docs. Anyway, seems to be a duplicate of this other issue that you should probably follow and vote for instead: https://jira.mongodb.org/browse/SERVER-863 |
| Comment by Glenn Maynard [ 13/Apr/12 ] |
|
Since the large majority of documents have fewer than 128 unique keys, a variable-length encoding could be used to store the symbol offset for strings. The encoding algorithm from UTF-8 is a good one. That way, the key names in most documents, which have a reasonably small number (up to 128) of static keys, would be stored as a single byte. Up to 2048 would be encoded in two bytes. (This would require that each collection have its own table, though.) The byte 0xFF, which is never the first byte in a UTF-8 sequence, could be used to mean "not in the symbol table", with the symbol immediately following it. This would add one byte of overhead per document key in the worst case. |
| Comment by James Gray [ 12/Apr/12 ] |
|
This would be such a great improvement. At the moment I am having to move most of my data to another database. Frankly, it is because I did make the assumption that keys were pooled in some way. Now that I know that is not the case, it's enough for me to switch back to a more traditional DB. When you are dealing with millions of documents, it is a bit of a deal breaker. I hope MongoDB makes space-efficiency a higher priority in the near feature, so that I can use it for more of my projects. It really is a joy to use. |
| Comment by Glenn Maynard [ 04/Apr/12 ] |
|
It's bad to get people in the habit of obfuscating their keys--storing data efficiently is the database's job. It's pretty surprising that string pooling wasn't done from day one. Don't assume that keys are finite, though; you can definitely have dynamic strings (eg. UUIDs) as keys. The string table needs to fit in memory, so it needs some limitations on what's stored there (eg. maximum string length). Also, please note https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/EO6RDk-ATfU. I'm manually padding objects within arrays to a fixed BSON length, to ensure that later updates can happen in-place. If this feature is added, the stored BSON size will no longer be the same as the size we see on the client-end. A prerequisite for this feature should be an array-size hint in BSON, which would allow clients to request that array items, when stored on disk, be padded to a specific size. That avoids this problem. |