[SERVER-40759] New Agg Metadata Source to Generate A Single Empty Document Created: 22/Apr/19  Updated: 14/Dec/22

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: 4.0.0
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Patrick Meredith Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-39108 Make $limit accept 0 for all document... Closed
Assigned Teams:
Query Execution
Participants:

 Description   

There are quite a few situations in which {$limit: 0} would be useful, for instance, currently in the BI-Connector we use $collStats to inject a single document when we need to pushdown a subquery that needs exactly one projected result out.

However, $collStats has some issues, and we think $facet might be a better choice. But in this case we would prefer {$limit: 0} to {$limit: 1}, since we literally don't care about the result, so for example:

select a from foo inner join (select "hello")

We would like to push down the subquery select "hello" as:

{$facet: {out: [{$limit: 0}]}}
{$project: {'hello': 'hello', '_id': 0}}

At some future point this could even be a optimized to not look at any documents (currently on a collection of 100K docs, {$limit: 1} will look at 853 of them in a $facet on server 4.0), but for now, simply changing the error condition to be negative from non-positive would be an improvement.



 Comments   
Comment by Kevin Pulo [ 13/May/19 ]

You're saying you want a single document synthesized from a stage regardless of how many documents are passed into it?

Yes. Like $collStats and friends, this would only really make sense at the start of a pipeline, and it would be fine if that was enforced (to prevent pipeline coding errors from causing the results of earlier from accidentally being lost), though this might affect usage in some situations (eg. views).

The direct motivation is that currently you can get a single empty document from somewhere else, but they are all hacks. The possibilities are:

  • An actual document from the collection. This could be inefficient, if the doc is large or disk is involved (because the entire contents of the doc are immediately thrown away). It fails if the collection is empty. It may require talking to a shard, refreshing shard metadata, etc.
  • The result of $collStats. This might require locks/mutexes? In any case, it certainly does needless work, because all the fields are thrown away. It still needs a trailing $limit: 1 in case the collection is sharded. It's also inefficient in that case because it will contact all the relevant shards and wait for their results. It's not allowing inside a transaction. The user might not have the necessary privs to run this stage.
  • The result of $currentOp. Again, this does needless work. It needs to run on the admin db. If sharded then inprog priv is required. Not allowed in a transaction.

The indirect motivation is sub-queries.

In the case of the BIC, this is a direct need (I believe — I'm no expert on SQL subqueries). The $lookup stage already serves the purpose just fine, as long as a suitable input document can be crafted for it.

In my case, I want to issue 1000 queries to the server, each of the form:

// as a query:
{ foobar: { $gte: "baz" }, $limit: 1 }
 
// as a pipeline:
[ { $match: { foobar: { $gte: "baz" } } },
  { $limit: 1 } ]

without having to do 1000 round-trips to the server (which is my only other option). While a "bulk query" feature could be implemented to support this, again the situation is such that $lookup does what I need, as long as I can craft the input to it. Here's a mongo shell implementation of this idea.

Basically, it's possible, but clumsy, to do this:

[ { $collStats: {} },
  { $limit: 1 },
  { $project: { foo: [ "bar", "baz" ] } },
  { $unwind: "foo" },
  { $lookup: ... } ]

and so the ask here is to be able to instead do:

[ { $emptyDocument: {} },
  { $project: { foo: [ "bar", "baz" ] } },
  { $unwind: "foo" },
  { $lookup: ... } ]

which is clearly better in a variety of ways.

I'm deliberately not asking for more substantial/advanced features (eg. bulk-query or "actual" sub-query) because those would be a lot of work, whereas once the synthetic document has been obtained, the existing tools available ($lookup, $project, $addFields, $replaceRoot, etc) are sufficient to achieve the desired goal without too much hoop-jumping.

For reference, I expect the proposed $emptyDocument stage would have a trivial core implementation pretty close to this:

DocumentSource::GetNextResult DocumentSourceEmptyDocument::getNext() {
    pExpCtx->checkForInterrupt();
 
    if (_finished) {
        return GetNextResult::makeEOF();
    }
 
    _finished = true;
    return {Document(BSON())};
}

which is why I'm trying to minimise the amount of supporting code (eg. to parse and use a given literal document), because the idea is just some simple and easy sugar to replace the existing clumsy technique.

Other possible solutions could be things like the below (with hopefully obvious semantics), all of which I would expect be more work for not much benefit.

[ { $literalDocument: { foo: [ "bar", "baz" ] } },
  { $unwind: "foo" },
  { $lookup: ... } ]

[ { $literalDocuments: [
    { foo: "bar" },
    { foo: "baz"}
  ] },
  { $lookup: ... } ]

[ { $injectDocument: {
    where: "start/end/n",
    passThrough: false,
    document: { foo: [ "bar", "baz" ] }
  } },
  { $unwind: "foo" },
  { $lookup: ... } ]

[ { $injectDocuments: {
    where: "start/end/n",
    passThrough: false,
    documents: [
      { foo: "bar" },
      { foo: "baz" }
    ]
  } },
  { $lookup: ... } ]

[ { $generateDocuments: {} },
  { $limit: 2 },
  { $project: {
    foo: { $switch: { branches: [
      { case: { $eq: [ "$_id", "0" ] }, then: "bar" },
      { case: { $eq: [ "$_id", "1" ] }, then: "baz" }
    ] } }
  } },
  { $lookup: ... } ]

Comment by Asya Kamsky [ 02/May/19 ]

You're saying you want a single document synthesized from a stage regardless of how many documents are passed into it? If that's the case why can't you use $collStats (you say "issues" but you don't say what they are).

Rather than requesting specific implementation, can you please describe your use case (completely) so we can figure out the best way to address it long term?

Comment by Kevin Pulo [ 23/Apr/19 ]

I've also had a similar need in the past. In that case, I used a collection that I knew would have at least one document, and then started my pipeline with

    { $limit: 1 },
    { $project: { _id: 0, foo: "" } },
    { $project: { foo: 0 } },

to get a single empty document, that I then populated as necessary for the sub-query.

Rather than mess around like this, or with similar $collStats or $facet hacks, would it be better to have a simple "$emptyDocument" DocumentSource stage that just outputs a single empty document? You could then easily $project in whatever fields you like, as normal. And if you want multiple created documents, then you could $project an array and $unwind it.

Comment by Kelsey Schubert [ 22/Apr/19 ]

Opposite of SERVER-39108.

Generated at Thu Feb 08 04:55:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.