[SERVER-18210] Add query for document structure Created: 25/Apr/15  Updated: 06/Dec/22  Resolved: 18/Dec/17

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Querying
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Yair Lenga Assignee: Backlog - Query Team (Inactive)
Resolution: Won't Fix Votes: 0
Labels: expression, stage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-18207 Allow Queries for limit String sizes Closed
is duplicated by SERVER-18208 Allow Queries to find ARRAY sizes Closed
is duplicated by SERVER-18209 Add QUERY option to limit ARRAY size Closed
Related
related to SERVER-13447 provide $projection operator to get t... Closed
Assigned Teams:
Query
Backwards Compatibility: Fully Compatible
Participants:

 Description   

Please add a query that will return the "structure" of the document, instead of the data. The structure will allow application to understand the data that is embeded in the document, and then construct a query that will be efficient.

In concept, this is similar to the JDBC meta data query.

Given the schema-less of Mongo, I'm not sure that there is a perfect solution. The basic idea is to summarize the data. In theory, constructing a JSON schema from the data will work, but this is unrealistic approach.

Some ideas:

  • Replace STRING values with "STRING(N)" (N is size)
  • Replace ARRAY with ARRAY of TYPE. If all entries have same type (e.g., number, string) it will be ARRAY[N] of type. Otherwise ARRAY[N] of object.
  • Replace BLOB with BLOB(n).
  • Replace array of OBJECTS, with ARRAYS of ([ list of available attributes] )

Motivation:
When processing documents from existing repositories when the full structure is unknown, applications are forced to load complete documents, just to find out what data is available.

Having the ability to get (some) meta data, will reduce the amount of data that is loaded by a factor of 100X for our application.



 Comments   
Comment by Asya Kamsky [ 19/Jul/17 ]

As we now have $objectToArray and $arrayToObject expressions along with $type, $size, $strLenCP, etc. I think this ticket can be closed.

Comment by Charlie Swanson [ 08/Mar/16 ]

I think a better way to achieve this desired outcome would be to provide a way to get the keys out of an object, and possibly to reconstruct an object. If you can manipulate the field names and the corresponding values of an object, then the rest of the summarization could be done using $size, $strLen (code points or bytes, see SERVER-14670), or $type (SERVER-13447). For example, I think this could be accomplished by something like these expressions:

  • $unwindObject (work for a similar expression is being tracked under SERVER-11392) - Takes an object and returns a vector of tuples (key, value) for the object.
  • $constructObject - Takes a vector of tuples (key, value) and constructs an object.

With those expressions, one could unwind an object, then do a $map over the (key, value) pairs, replacing the value with some summary of the value, then reconstruct the object with the new values.

Obviously I haven't fully thought through what those would look like, but I think those would be more generally useful than an expression to summarize the data.

asya does this sound reasonable to you?
yair.lenga@gmail.com can you confirm this would satisfy your use case?

Comment by Ramon Fernandez Marina [ 05/Mar/16 ]

For those watching the ticket without knowledge of JIRA and our use of it, this is to let you know that this feature request has been sent to the Query team for consideration in their next round of planning. Any updates to this state will be posted on this ticket.

Thanks,
Ramón.

Comment by Neville Dipale [ 25/Apr/15 ]

I agree with you on the [Implement X in server so we don't implement it ourselves in client], and I think there are a number of other features which we as users would love to see. One approach to this is for the server-side scripting to be improved/overhauled with something that could allow us to create procedures/functions in server to achieve what we want. Otherwise 3 years down the line we'll have lots of arbitrary functions that shouldn't be in core, or people doing what you're trying to achieve not getting requested features and moving elsewhere.

I think the [server-side scripting] bit would reduce the load on the Mongo team in the long-run because they wouldn't need to maintain a lot of extra functions.

I would love to move a number of my scripts from the app into the server so I remove the network time cost that I incur every time I run certain queries

Comment by Yair Lenga [ 25/Apr/15 ]

I believe that the major benefit of this feature is reduction in the amount of data that will be TRANSFERRED from the server to the client. By moving the functionality into the MongoDB server, the same amount of work will reduce the size of the data.

I agree that detecting the structure of the whole collection is big. I hope to have the ability to summarize the structure a small subset, one document at a time. For my cases (large time series embedded in ~4MB documents), the metadata to describe the 4MB set was less then 4K (with int[5000] in the meta data, representing ~32K of int data). I will be happy with this saving, leaving the much harder problem (metadata for the collection as a whole) for the future.

For server-based application, where the bulk of the processing is done on the Web server (Java, in my case), performing the processing in MongoDB will reduce the time for encoding, network transfer, decoding and memory requirement of the document. In my case, we noticed that this transfer/parse time is where most the time is spent. Reading the data in Mongo seems to be a small fraction of the data.

For Browser-based appliation, where the data is transported thru the internet, into Javascript application, I believe the saving is going to be significantly higher, as network transfer rates are order of magnitude slower vs, server to mongodo connection, in addition to encryption/decription code.

As far as caching, I can only comment about my planned usage - creating large time-based data sets: the calls are performed in response to interactive user requests against ranges that the user need. The critical thing is the time to deliver that data. It is unlikely the different users will ask for the structure of the same documents - I'm not sure if caching will help my specific application.

Comment by Neville Dipale [ 25/Apr/15 ]

Won't this force Mongo to also load complete documents (entire disk read?) in order to return the structure? What happens when there are say 1 million documents with some differences to the extent that 10 or more different schemas exist?

If it is possible to efficiently create the feature, the structure could perhaps be cached so that an entire collection scan is not performed each time the query is run.

Generated at Thu Feb 08 03:46:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.