Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28078

Database entry internal query redirecting by new type of referencing

    • Type: Icon: New Feature New Feature
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Querying
    • Labels:
      None
    • Fully Compatible

      Hi!

      I had an idea of how referencing in MongoDB could be extended, making querying more versatile and removing the need for querying joins.

      I am working on a Java based framework for automatically storing data in MongoDB, so the following example might in parts relate to the Java driver, but the idea itself should not be platform related.

      The Problem:

      When encoding an object to create a new entry inside a collection, I came across the problem of multiple references inside that object referencing the same sub object. Here is an example (the "@number" represents an object id, so same @id means this is infact the same instance):

      TypeA (@000) : {
         fieldA1: TypeB (@001) : {
            fieldB1: ImageType (@002) : [*hugeByteData*]
            fieldB2: String (@003) : "foo"
         }
         fieldA2: TypeC (@004): {
            fieldC1: TypeB (@001) : {
               fieldB1: ImageType (@002) : [*hugeByteData*]
               fieldB2: String (@003) : "foo"
            }
         }
         fieldA3: String(@005) : "bar"
      }
      

      As you can see, the instance TypeB(@001) is referenced twice, so using this type's string field in a query can be reached via 2 paths:

      • At the path "fieldA1.fieldB2"
      • At the path "fieldA1.fieldC1.fieldB2"

      When encoding the instance TypeA(@000) above, I have the following problems:

      • I would like not to encode the instance TypeB(@001) twice, because of the expensiveness of its image field.
      • Since the developer using my framework might like to query for TypeB's string field to find the image he wants, he might also use any of the 2 valid paths; so I have to encode TypeB a second time anyways to ensure he does not run into using a query that will not deliver the expected result.
      • Encoding TypeB(@001) twice is not only expensive, it is also irrelevant for later decoding, since I will have to restore the original TypeA(@000) instance where the 2 paths above point to the exact same instance. So when decoding, I will decode TypeB(@001) the first time it occurs, and skip it when finding it another time. That means persisting it twice will indeed be only for the vague possibility the user will be querying a path inside the instance.
      • You could even come across a scenario, where an instance is not only referenced twice, but the references are even a closed circle, like in a doubly linked list, where you can easily continue the path parent.child.parent.child.* forever. In that example, you cannot persist an instance twice, because you will then persist it an infinite amount of times, not only wasting database space but causing a stack overflow. And if you do not write instances more than once at all, any query with a path containing the parent will not work, so you also cannot query something like "where field='foo' and parent.field='bar'".

      Of course, the obvious solution for this example would be persisting instances of TypeB in a seperate collection. But that is not possible, because:

      • I do not know what kind of objects the user of my framework is going to persist, so I also will not know what types might have to be stored in a seperate collection.
      • If I decide spontaneously while encoding that a type has to be extracted to a different collection, the user will not have any knowledge about that, so any querying done by him afterwards will not find any results. Since Joins are a no-no in MongoDB, I also cannot simply join automatically created collections together when the user is querying.
      • If I decide to leave the use of extracted sub-instance types like TypeB to the user (for example by tagging TypeB with an annotation my framework looks for), he will have to keep that in mind when creating queries, causing creating queries to become quite uncomfortable.

      My Suggestion:

      My suggestion is creating a new reference type that contains a path which is just valid inside the same db entry the reference itself is located in. When a query touches such a reference, it continues querying in that entry at the path the reference points to.

      For my example above, this would look like this:

      TypeA (@000) : {
         fieldA1: TypeB (@001) : {
            fieldB1: ImageType (@002) : [*byteData*]
            fieldB2: String (@003) : "foo"
         }
         fieldA2: TypeC (@004): {
            fieldC1: REFERENCE("fieldA1")
         }
         fieldA3: String(@005) : "bar"
      }
      

      So when firing a query like...

      "fieldA2.fieldC1.fieldB2." *eq* "foo"
      

      ..., the driver will recognize the reference at "fieldA2.fieldC1" and will redirect the query by exchanging the path's suffix the driver is currently searching at with the path inside the reference. So then, the query path (just for the current collection entry!) would be interpreted as:

      "fieldA1.fieldB2." *eq* "foo"
      

      That would mean no duplicate data, no writing data that is only used for queries and clean data structure while still having full query functionality.

      In the Java driver, when writing the encode() Method of a codec, this means the interface org.bson.BsonWriter would have to have a new method like:

      org.bson.BsonWriter.writeInternalReference(String path)
      

      When writing my codec's decode() method, I would have to react to the reference by simply referencing the instance I have already decoded at the path of the reference.

      For me, this solution looks quiet simple while being very versatile, creating a lot of flexibility. Of course, I do not know internals of the Server and its drivers, so I cannot estimate the amount of complexity that is necessary to implement such a feature. Also, I hope the idea does not violate any concept of MongoDB/NoSQL.

      I would love to hear what you think!

            Assignee:
            asya.kamsky@mongodb.com Asya Kamsky
            Reporter:
            SenorDumpfbacke Tobias Weber [X]
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: