-
Type: New Feature
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Storage
-
None
-
Storage Execution
We've seen lots of people making abbreviated attribute names like "rcv_uid" for "receiving_user_id" or "pw" for "password" in a document - to improve space efficiency where they could repeatedly appear in a large collection. This is kind of ironic because, one of the greatest aspects of document-oriented database is to have flexible and "intuitive" document structure.
While we all like the idea of schema-free design, in reality, we actually NEED to have schema for better performance. Documents should be structured in a certain way, and indexed attributes are critical.
Here's a big question: What if we had a global symbol table for any attribute names in the database?
Possible values for attributes are unlimited, but possible "keys" are practically limited.
If we map the keys using 32bit symbol table as follows:
0x0001 => receiving_user_id (17 bytes -> 4 bytes)
0x0002 => password (8 bytes -> 4 bytes)
and the persisted presentation of document could take up less space. The median of key length in my past projects is like 10-12 bytes (e.g. "achievement_id", "leaderboard_id", "max_version_id", "icon_content_type"), so it's a big deal. In some pathological cases where 1-3 byte keys are used it means slight increase in size of course, but practically it will be almost always a win.
But the best part of this feature is change in mentality- we could stop worrying about keys taking too much space and start to use clear, descriptive key names for the document schema. As Phil Karlton said, "There are only two hard things in Computer Science: cache invalidation and naming things." Let's keep naming things non-restrictive.
- duplicates
-
SERVER-863 Tokenize the field names
- Closed