[SERVER-16561] create graph database abilities on top of existing data Created: 16/Dec/14 Updated: 06/Dec/22 |
|
| Status: | Open |
| Project: | Core Server |
| Component/s: | Index Maintenance, Performance |
| Affects Version/s: | None |
| Fix Version/s: | Needs Further Definition |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Chad Kreimendahl | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Query Execution
|
||||||||
| Participants: | |||||||||
| Description |
|
A great way to add some relational-ness to the database and give it more powerful features using data already in the system would be to implement a method to define nodes and vertices of a graph through some definition or configuration of an existing collection or set of collections, much the way you'd define an index. Currently many of us are using another or our own custom build graph databases to get around quickly solving relational shortfalls well identified in mongo. Maintaining the same data in two systems for this could easily be overcome and improve the lives of many. In the same way that you've improved your database by adding text indexing, collection / document locking, etc, it would be great to see this layer added on top for those of us with wildly complex relationships to create, but for whom a document database is still substantially better than a traditional relational SQL database. There are others in the document db space doing this, but your document and api implementation are vastly superior. It would be great to see you give some method for self-defined n-tiered relationships through a graph style while keeping the query language as simple as possible. |
| Comments |
| Comment by Asya Kamsky [ 09/Oct/16 ] |
|
3.4 will add a new stage called $graphLookup to aggregation framework. Need to determine how much of desired graph functionality remains in this ticket. |
| Comment by John Crenshaw [ 01/Sep/15 ] |
|
In response to "what an ideal implementation would do...what [people] are currently doing...samples of data models, semantics..." One thing to consider carefully would be the Tinkerpop Blueprints API (see https://github.com/tinkerpop/blueprints/wiki). There is a 3rd party MongoDB backed implementation in Java that you can see at https://github.com/datablend/blueprints-mongodb-graph, but the original developer of that project has moved in a different direction (see https://groups.google.com/forum/#!topic/gremlin-users/nF-OUsQ4KVw), so it is unmaintained, and likely to never be maintained again. The implementation is also external to the database server, so likely had substantially lower performance than a more direct implementation could likely have provided (and the performance concerns of additional abstraction layer were a key factor in the author's abandonment of that project.) A core level Blueprints implementation (similar to what ArangoDB and/or OrientDB have) would be an ideal first step, since this would allow the community to use the substantial and well maintained toolset associated with these APIs. This would provide a familiar, functional, comfortable interface for anyone familiar with graph databases, even if no other support was added. As far as core language support, you presumably need some graph definition options for existing methods, some structural requirements for graph documents, and a couple of graph traversal operators for queries and pipeline. I would suggest:
I think it is extremely important that any graph implementation fully support sharding. In particular, the graph collection should be able to be defined with a shard key that can be used to shard all vertices and edges. All vertices and edges in the collection would share the same shard key, regardless of type, which allows a high level of control over cardinality and query isolation during graph traversal. Sharding of a vertex is easy, and would work as with any other collection. Sharding of edges would be slightly more complicated: where _in and _out have the same shard key, it shards just like a vertex (except that _in and _out were evaluated for the key), and stored together using the same partition ranges, such that a vertex is always stored on the same shard as the connected edge. Where _in and _out are different (a so called "split edge"), the edge would need to be stored on each shard, again, so that every edge for each vertex can be found on the same shard. I know that gets into territory that you have been avoiding (a single record split across multiple servers as a result of a split shard key), I don't know if that is a deal breaker for this or not. Other possibilities to consider:
|
| Comment by Chad Kreimendahl [ 16/Dec/14 ] |
|
Sure. Some of that I'll have to provide privately. The short and sweet of it is: We have data that's going to mostly fit a schema. However, being rigid about that schema is not important as it allows us to do internal schema changes without ever modifying the database. Within that schema are references from one document to another, typically defined in an property that's an array within each document. This lets us know that content in document A relates to documents B,C,D, for example. We also provide functionality that allows our users to create formulas to do simple or even very complex math or functions against data n-tiers away. So in our system, we allow you to say, for example, that you want to add all the great-grandparent documents "Age" properties of document A. So what we've got is a graph of all our data relationships which allows us to find grandparents of A through a specific relationship (property with array of _ids) in order to properly bring back some aggregation. Instead of querying A, all of A's relationships (Ar), all of Ar's relationships (Ar2) and then all of Ar2's relationships (great-grandparent), we simply query the graph and get the list back in one set of _ids. Alternately, in that formula scenario, we may change one of the great-grandparent documents and thus need to know all of the great-grandchildren who need to have their data updated based on any changes. And that's the simple explanation. We actually do stuff wildly more complicated than that in some cases in order to accomplish some really cool features.
|
| Comment by Ramon Fernandez Marina [ 16/Dec/14 ] |
|
sallgeud, can you please provide further information of what an ideal implementation would do? Details on what you're currently doing (or any similar existing system) and how is MongoDB falling short, samples of the data model, semantics... Anything that can provide a better picture and define the scope for this feature request. Thanks, |