[SERVER-15099] Improve dependency generation for $unwind Created: 29/Aug/14  Updated: 09/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 2.6.1
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Maxime Beaudry Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 1
Labels: optimization, patrick
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 74 percent of the time spent in unwind.PNG     Zip Archive Mongo Profiling Result 2.zip     Zip Archive Mongo profiling result.zip     PNG File Performance summary.PNG    
Assigned Teams:
Query Optimization
Participants:

 Description   

Currently the DocumentSourceUnwind::getDependencies() method assumes that we need the whole array referenced by the $unwind, including all subfields. This is actually necessary to handle specific edge cases, such as when an array looks like [{subfield: 1, unneeded: "something large and expensive"}, "Not an object"]. In the current encoding scheme used by our dependency analyzer, if we were to depend on just "array.subfield", the $unwind would emit only 1 document from that array rather than the currently specified 2. While that wouldn't effect the outcome of all $groups, it would for groups with a {count: {$sum: 1}} clause in them. We can handle this correctly with a better encoding of dependencies in our analysis pass that would note that we depend on the exact length of the array, but only a subset of its contents.

If we add some way to encode that we only need the length of the array, $size should also use this.

Original description:

A thread for this issue has already been started in SERVER-13703. Since it was not related anymore to the original issue of SERVER-13703, I open this new issue.

I have a collection that contains 10000 documents. Each document looks like the following:

{
 "_id" : NumberLong(1),
  "NonIndexedField" : NumberLong(1),
  "DateTime" : ISODate("2014-06-02T19:59:09.836Z"),
  "Field1" : "72eda069-1737-4db6-9367-8399ebdc0f8e",
  "Field2" : "2e72ccdc-0e40-49b8-9d13-2b74855fd938",
  "Field3" : "29e1caad-355e-4351-9d4b-f160225f41b3",
  "Field4" : "ef6083c5-b919-411b-b942-d984b050d91f",
  "Field5" : "39437c80-7423-4c50-b06f-59018b0437f5",
  "Field6" : "35d2301f-cb96-4392-9084-ff057142f35b",
  "ChildrenLevel1" : [a fairly big array]
}

Each document is on average 512 kB big. So, it means that ChildrenLevel1 is quite a big array. In fact it contains only 10 elements but each of these elements contain a lot of other elements, sub arrays, etc.

If I try to find the max value of a top level field using the aggregation framework, it takes virtually 0ms. Here is the query that I use:

db.Demo.aggregate([
	 {$group: {
	           _id : 0, 
	           max : {$max : "$DateTime"}
	          }
	 }
	])

If I try to do the same thing on a date that is within the ChildrenLevel1 array, it now takes 36 seconds at the minimal. The first time that I ran the query it took even longer since data was not in RAM. On the subsequent call, it took 36s. This makes for an average of 3.6 milliseconds per document. IMHO, this is quite long. Here is the query:

db.Demo.aggregate([
	 {$unwind: "$ChildrenLevel1"},
	 {$group: {
	           _id : 0, 
	           max : {$max : "$ChildrenLevel1.DateTime"}
	          }
	 }
	])

I created a backup of my test database. This backup can be provided if you want, but I have to say that it is somewhat big. The compressed backup is about 1GB big. Instead of transferring this big backup, I can provide a little Windows command line executable that can be used to create the database.

What was interesting while creating the backup was to see how fast the operation was:

d:\MongoDb\bin>mongodump.exe --db MongoProjectPerformanceProblem --out d:\MongoDb\PerfProblem.direct.backup --dbpath d:\MongoDb\Data
2014-08-27T06:23:07.916-0400 DATABASE: MongoProjectPerformanceProblem    to     d:\MongoDb\PerfProblem.direct.backup\MongoProjectPerformanceProblem
2014-08-27T06:23:08.181-0400 [tools] query MongoProjectPerformanceProblem.system.indexes planSummary: COLLSCAN ntoreturn:0 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros)
 W:50769 r:199372 nreturned:1 reslen:110 108ms
2014-08-27T06:23:08.181-0400    MongoProjectPerformanceProblem.system.indexes to d:\MongoDb\PerfProblem.direct.backup\MongoProjectPerformanceProblem\system.indexes.bson
2014-08-27T06:23:08.181-0400             1 documents
2014-08-27T06:23:08.181-0400    MongoProjectPerformanceProblem.Demo to d:\MongoDb\PerfProblem.direct.backup\MongoProjectPerformanceProblem\Demo.bson
2014-08-27T06:23:09.834-0400 [tools] warning Listener::getElapsedTimeMillis returning 0ms
2014-08-27T06:23:11.004-0400 [tools]            Collection File Writing Progress: 3900/10000    39% (documents)
2014-08-27T06:23:11.114-0400 [tools] warning Listener::getElapsedTimeMillis returning 0ms
2014-08-27T06:23:12.486-0400 [tools] warning Listener::getElapsedTimeMillis returning 0ms
2014-08-27T06:23:13.875-0400 [tools] warning Listener::getElapsedTimeMillis returning 0ms
2014-08-27T06:23:14.000-0400 [tools]            Collection File Writing Progress: 8300/10000    83% (documents)
2014-08-27T06:23:15.029-0400             10000 documents
2014-08-27T06:23:15.513-0400    Metadata for MongoProjectPerformanceProblem.Demo to d:\MongoDb\PerfProblem.direct.backup\MongoProjectPerformanceProblem\Demo.metadata.json
2014-08-27T06:23:15.528-0400 [tools] dbexit:
2014-08-27T06:23:15.528-0400 [tools] shutdown: going to close listening sockets...
2014-08-27T06:23:15.528-0400 [tools] shutdown: going to flush diaglog...
2014-08-27T06:23:15.528-0400 [tools] shutdown: going to close sockets...
2014-08-27T06:23:15.528-0400 [tools] shutdown: waiting for fs preallocator...
2014-08-27T06:23:15.528-0400 [tools] shutdown: closing all files...
2014-08-27T06:23:15.981-0400 [tools] closeAllFiles() finished
2014-08-27T06:23:15.981-0400 [tools] shutdown: removing fs lock...
2014-08-27T06:23:15.981-0400 [tools] dbexit: really exiting now

As you can see, it took only about 8 seconds. So, reading the whole database from disk and then writing it to disk was 4.5 times faster than processing with the aggregation framework the same amount of data... and this data was already loaded in RAM. Can you please confirm that you think like me that the aggregation framework query should be much faster than the backup operation?

Note that all these tests were made on a machine that has lots of available memory. After my tests, there was still about 30 GB of unused RAM.

I have the feeling (I may be TOTALLY wrong here) that the aggregation framework copies the BSON documents contained in Mongo's memory map file into a set of new objects and that this is the cause of the performance degradation. Copying the object may not be as efficient as it can be. Can you confirm if I am wrong or right? I did a quick profiling of the source code and found out that a lot of the time was spent around the following code:

case Array: {
   intrusive_ptr<RCVector> vec (new RCVector);
   BSONForEach(sub, elem.embeddedObject()) {
       vec->vec.push_back(Value(sub));
   }
   _storage.putVector(vec.get());
   break;
}

Note that I would really like to contribute to resolving this issue. I am opened for discussion and/or to contribute on Mongo's source code. I really need this issue to be fixed because it is blocking us to go forward in dropping the costly SQL Server Enterprise + a Microsoft OLAP Cube. I really believe that Mongo's aggregation framework coupled with the memory mapped file can really compete (and probably win) against our current DataWarehouse and Cube.



 Comments   
Comment by Asya Kamsky [ 08/Mar/16 ]

While this issue is still important and will be considered, there is now a way in version 3.2 that its impact can be minimized by using new array operators/accumulators/expressions which minimize the need for full $unwind in some cases.

In particular, the specific example in the description can now use $max during $group on array without doing an $unwind first:

db.Demo.aggregate([
	 {$group: {
	           _id : 0, 
	           max : {$max : {$max: "$ChildrenLevel1.DateTime"}}
	          }
	 }
])

This ticket will still track improving dependency generation as there are many cases more complex than the example where $unwind is necessary, but I wanted to highlight this example in case it can help someone.

Comment by Asya Kamsky [ 04/Sep/14 ]

mabead it might make sense to take a look at your hardware (that discussion should move out of SERVER project though) - unless all your queries are quite slow this may be indicating very insufficient RAM.

I would suggest posting to MongoDB-user Google Group to discuss others' performance on comparable hardware maybe.

Comment by Maxime Beaudry [ 04/Sep/14 ]

Your interpretation of the csv is correct. So if I take my second mongo query that took 145 seconds, here is the break down:

  1. 54s (37%) for extracting the raw content from the db... when data is already loaded in RAM.
  2. 48s (33%) doing useless $project
  3. 43s (30%) doing the rest of my aggregation

Keeping in mind that SQL Server was able to achieve the same result in 29s and that I have an OLAP Cube that runs even faster, I don't see how Mongo can compete. Let's say that the $project is 100% optimized, it will still take 54+43=97s. That's still 3.3x times slower than SQL Server.

All this investigation caused me to be dissapointed on two other aspects:

  1. 54s was spent extracting the raw content from the db and this content was already in RAM. Since data was already loaded in RAM, I was hoping for this to be virtually free.
  2. Loading data in RAM is terribly long. As I saw, the first execution of the query took 18 minutes. Here I don't know who is to blame: my hardware or Mongo.

For these two reasons I will have to go and look at another alternative. I still think that Mongo is an excellent database for operational transactions but it is not ready for really fast aggregations. I am now thinking about the Hadoop-Mongo integration. Do you think that this could yield significantly better results?

Thanks for your excellent support. I will certainly continue to consider Mongo for other use cases.

Comment by Mathias Stearn [ 03/Sep/14 ]

$project is known to be slower than it should be, and in need of some optimization work. If I'm reading the csv correctly it looks like 37% of the time is under DocumentSourceCursor::getNext (which is what extracts the raw data from the db) and 70% under the $project (which calls into the DocumentSourceCursor). This means that about 33% of the time is spent actually doing the project. This is especially unfortunate in this case where the $project isn't really needed, it is just to inform the cursor which fields we need.

The only "workaround" I can really offer at this point is to pre-project just the info you need into smaller documents, possibly with shorter field names.

Comment by Maxime Beaudry [ 03/Sep/14 ]

I just ran another Visual Studio 2013 profiling session and I attached the result ("74 percent of the time spent in unwind.PNG" and "Mongo Profiling Result 2.zip"). As you can see in the PNG, 74% of the time is spent in the $unwind and 70% in the "workaround" $project.

Comment by Maxime Beaudry [ 03/Sep/14 ]

I tried with a real dataset. The collection that I query contains 1.5 million documents that consumes 106 GB. That's about 74kB per document. Here is my query:

> db.OperationRuns.count( { 'orProps.endTime': { $gte: ISODate("2010-09-20T00:00:00Z") } } )
206972
 
> db.OperationRuns.aggregate(
...  [
...  {$match : {
...              'orProps.endTime' : { $gte: ISODate("2010-09-20T00:00:00Z") }
...            }
...  },
...  {$project: {
...              "orProps.Operation" : 1,
...              "seqRuns.stepRuns.props.status" : 1
...             }
...  },
...  {$unwind: "$seqRuns"},
...  {$unwind: "$seqRuns.stepRuns"},
...  {$group: {
...            _id : {operation : "$orProps.Operation", status : "$seqRuns.stepRuns.props.status"},
...            sum : {$sum : 1}
...           }
...  }
...  ]).pretty()
{ "_id" : { "operation" : "BSCAN", "status" : "PASS" }, "sum" : 10021 }
{ "_id" : { "operation" : "ISPL", "status" : "FAIL" }, "sum" : 40 }
{ "_id" : { "operation" : "SIAT", "status" : "FAIL" }, "sum" : 475 }
{ "_id" : { "operation" : "AD", "status" : "FAIL" }, "sum" : 133 }
{ "_id" : { "operation" : "AD", "status" : "PASS" }, "sum" : 17516 }
{ "_id" : { "operation" : "ISPX", "status" : "PASS" }, "sum" : 7944 }
{ "_id" : { "operation" : "ICT", "status" : "FAIL" }, "sum" : 266031 }
{ "_id" : { "operation" : "FT2", "status" : "FAIL" }, "sum" : 336 }
{ "_id" : { "operation" : "BSCAN", "status" : "FAIL" }, "sum" : 129 }
{ "_id" : { "operation" : "ISPL", "status" : "PASS" }, "sum" : 3710 }
{ "_id" : { "operation" : "HP", "status" : "FAIL" }, "sum" : 84 }
{ "_id" : { "operation" : "BI", "status" : "PASS" }, "sum" : 844402 }
{ "_id" : { "operation" : "HP", "status" : "PASS" }, "sum" : 102405 }
{ "_id" : { "operation" : "FIST", "status" : "PASS" }, "sum" : 6525386 }
{ "_id" : { "operation" : "FT", "status" : "FAIL" }, "sum" : 171562 }
{ "_id" : { "operation" : "FT", "status" : "PASS" }, "sum" : 22586408 }
{ "_id" : { "operation" : "BP", "status" : "PASS" }, "sum" : 1317643 }
{ "_id" : { "operation" : "FT2", "status" : "PASS" }, "sum" : 163469 }
{ "_id" : { "operation" : "BI", "status" : "FAIL" }, "sum" : 2157 }
{ "_id" : { "operation" : "FIST", "status" : "FAIL" }, "sum" : 75468 }
{ "_id" : { "operation" : "ISPX", "status" : "FAIL" }, "sum" : 106 }
{ "_id" : { "operation" : "BP", "status" : "FAIL" }, "sum" : 3158 }
{ "_id" : { "operation" : "SIAT", "status" : "PASS" }, "sum" : 392002 }
{ "_id" : { "operation" : "ICT", "status" : "PASS" }, "sum" : 684249 }
>

As you can see, the aggregation query targets 206972 documents and, once all the arrays are unwound, the group operation processes about 33 million documents. That's about 160 leaf documents per top-level document. The first time that I ran the query it took 1096 seconds (18 minutes). On the second execution (once data was in ram), it took 145 seconds (2.4 minutes):

2014-09-03T06:58:31.992-0400 [conn2] command WIP.$cmd command: aggregate { aggregate: "OperationRuns", pipeline: [ { $match: { orProps.endTime: { $gte: new Date(1284940800000) } } }, { $project: { orProps.Operation: 1.0, seqRuns.stepRuns.props.status: 1.0 } }, { $unwind: "$seqRuns" }, { $unwind: "$seqRuns.stepRuns" }, { $group: { _id: { operation: "$orProps.Operation", status: "$seqRuns.stepRuns.props.status" }, sum: { $sum: 1.0 } } } ], cursor: {} } keyUpdates:0 numYields:138597 locks(micros) r:1014754266 reslen:1729 1096127ms
2014-09-03T07:01:58.078-0400 [conn2] command WIP.$cmd command: aggregate { aggregate: "OperationRuns", pipeline: [ { $match: { orProps.endTime: { $gte: new Date(1284940800000) } } }, { $project: { orProps.Operation: 1.0, seqRuns.stepRuns.props.status: 1.0 } }, { $unwind: "$seqRuns" }, { $unwind: "$seqRuns.stepRuns" }, { $group: { _id: { operation: "$orProps.Operation", status: "$seqRuns.stepRuns.props.status" }, sum: { $sum: 1.0 } } } ], cursor: {} } keyUpdates:0 numYields:2443 locks(micros) r:55146951 reslen:1729 145895ms

By the way, here is the output of 'explain':

{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {
                                        "orProps.endTime" : {
                                                "$gte" : ISODate("2010-09-20T00:00:00Z")
                                        }
                                },
                                "fields" : {
                                        "orProps.Operation" : 1,
                                        "seqRuns.stepRuns.props.status" : 1,
                                        "_id" : 1
                                },
                                "plan" : {
                                        "cursor" : "BtreeCursor ",
                                        "isMultiKey" : false,
                                        "scanAndOrder" : false,
                                        "indexBounds" : {
                                                "orProps.endTime" : [
                                                        [
                                                                ISODate("2010-09-20T00:00:00Z"),
                                                                ISODate("0NaN-NaN-NaNTNaN:NaN:NaNZ")
                                                        ]
                                                ]
                                        },
                                        "allPlans" : [
                                                {
                                                        "cursor" : "BtreeCursor ",
                                                        "isMultiKey" : false,
                                                        "scanAndOrder" : false,
                                                        "indexBounds" : {
                                                                "orProps.endTime" : [
                                                                        [
                                                                                ISODate("2010-09-20T00:00:00Z"),
                                                                                ISODate("0NaN-NaN-NaNTNaN:NaN:NaNZ")
                                                                        ]
                                                                ]
                                                        }
                                                }
                                        ]
                                }
                        }
                },
                {
                        "$project" : {
                                "orProps" : {
                                        "Operation" : true
                                },
                                "seqRuns" : {
                                        "stepRuns" : {
                                                "props" : {
                                                        "status" : true
                                                }
                                        }
                                }
                        }
                },
                {
                        "$unwind" : "$seqRuns"
                },
                {
                        "$unwind" : "$seqRuns.stepRuns"
                },
                {
                        "$group" : {
                                "_id" : {
                                        "operation" : "$orProps.Operation",
                                        "status" : "$seqRuns.stepRuns.props.status"
                                },
                                "sum" : {
                                        "$sum" : {
                                                "$const" : 1
                                        }
                                }
                        }
                }
        ],
        "ok" : 1

At first sight, this looks slow. I therefore executed an equivalent query on a SQL Server Enterprise that had about the same dataset:

	SELECT operation.F_OPERATION_NAME, dimStatus.F_RESULT_STATUS, COUNT(*) As cnt
	FROM T_FACT_STEP_RUN stepRun
	JOIN T_DIM_SEQUENCE_RUN seqRun ON seqRun.F_SEQUENCE_RUN_KEY = stepRun.F_SEQUENCE_RUN_KEY AND seqRun.F_PARTITION_DATE = stepRun.F_PARTITION_DATE
	JOIN T_FACT_OPERATION_RUN opRun ON opRun.F_OPERATION_RUN_KEY = seqRun.F_OPERATION_RUN_KEY
	JOIN T_DIM_RESULT_STATUS dimStatus ON stepRun.F_STEP_RUN_RESULT_STATUS_KEY = dimStatus.F_RESULT_STATUS_KEY
	JOIN T_DIM_OPERATION operation ON opRun.F_OPERATION_KEY = OPERATION.F_OPERATION_KEY
	WHERE stepRun.F_STEP_END_TIME >= '2010-09-20'
	GROUP BY operation.F_OPERATION_NAME, dimStatus.F_RESULT_STATUS

This took 29 seconds (on the first try!). That's 5 times faster than Mongo's second attempt (once data was loaded in ram). That's 38 times faster than Mongo's first attempt (when data was not loaded in RAM).

In fact, what I am attempting to do is to replace a SQL Server DataWarehouse / Microsoft OLAP Cube architecture by Mongo. So in the real use case, we don't typically do this aggregation against the DataWarehouse. We run the query against the cube. So in real life, our application can do this aggregation in less than 10 seconds. That's about 14.5 times faster than what I have with Mongo's best performance.

I was hoping to have equivalent performances with Mongo but with a much lower level of complexity, lower price and better scalability possibilities. But from these metrics, I am not sure if it is possible. Increasing the performance by a factor of 14x seems like a long shot.

Could I please have your perspective on this? Do you think that the poor performance that I see are caused by the $unwind? If I had a model where I don't have to unwind, do you think that I could see a drastic improvement?

Comment by Maxime Beaudry [ 02/Sep/14 ]

Still with my test database, I made the query a little more complex by unwinding two other sub arrays (that also contains 10 elements each):

db.Demo.aggregate([
 {$match : {_id : {$lte : 10000}}},
 {$project: {"ChildrenLevel1.ChildrenLevel2.ChildrenLevel3.Counter": 1}},
 {$unwind : "$ChildrenLevel1"},
 {$unwind : "$ChildrenLevel1.ChildrenLevel2"},
 {$unwind : "$ChildrenLevel1.ChildrenLevel2.ChildrenLevel3"},
 {$group: {
           _id : 0, 
           max : {$max : "$ChildrenLevel1.ChildrenLevel2.ChildrenLevel3.Counter"}
          }
 }
])

This query took about 25s to complete. It scanned 10000 documents that each had 10 elements in ChildrenLevel1, 10 elements in ChildrenLevel2 and 10 elements in ChildrenLevel3. So in the end, the query processed 10 million values of ChildrenLevel1.ChildrenLevel2.ChildrenLevel3.Counter. That's about 2.5ms per top level document and 0.0025ms per element under ChildrenLevel1.ChildrenLevel2.ChildrenLevel3.Counter. That's pretty good!

I will try (probably tomorrow) on a more real life database.

Comment by Maxime Beaudry [ 02/Sep/14 ]

I tried with the extra $project and it did improve performance significantly. The query took 36s without $project and 500ms with the extra $project. That's about 70 times faster! Thanks a lot! I will now try to use your "tip" with our real database. The one that I used in the current issue was something that I could share easily. I will let you know how it goes.

Note that this morning I ran a profiling session on mongod.exe with Visual Studio 2013. I attached the result if it can be of any help. This was done without the extra $project.

Thanks again!

Comment by Mathias Stearn [ 02/Sep/14 ]

Updated the title and description to reflect the underlying issue.

Comment by Mathias Stearn [ 02/Sep/14 ]

mabead78, perhaps that comment should be phased a bit more clearly, but a key phrase is the initial "If needed". The case in your aggregation is actually highly optimized not to copy everything for each unwound element, instead doing an in-place replacement. It can do that when it detects that no-one else is holding on to a copy of the document returned by $unwind.

However, I think you are running into a different issue. Currently the getDependencies() method assumes that we need the whole array referenced by the $unwind, while in your example we only need a specific subfield. This is actually necessary to handle specific edge cases, such as when the if your ChildrenLevel1 array looked like [{DateTime: 1234}, "Not an object"]. In the current encoding scheme used by our dependency analyzer, if we were to depend on just "ChildrenLevel1.DateTime", the $unwind would emit only 1 document from that array rather than the currently specified 2. While that wouldn't effect the outcome of your $group, it would if you had a {count: {$sum: 1}} clause in there. We can handle this correctly with a better encoding of dependencies in our analysis pass that would note that we depend on the exact length of the array, but only a subset of its contents, but that has not been designed yet.

While I usually discourage adding an explicit $project where it isn't needed, I expect it will actually help as a workaround in this case. Try adding {$project: {"ChildrenLevel1.DateTime": 1}} to the front of your pipeline and see if it improves performance. Again, only do this in cases where the built-in dependency analyzer misses something and only after confirming that it is measurably faster. You can see what the built-in dependency analyzer decided was necessary by looking at the "fields" field of the $query phase in an explain.

Comment by Maxime Beaudry [ 30/Aug/14 ]

I spent some time time looking at the source code and I found this interesting comment in DocumentSourceUnwind::Unwinder::getNext():

    boost::optional<Document> DocumentSourceUnwind::Unwinder::getNext() {
        if (_inputArray.missing() || _index == _inputArray.getArrayLength())
            return boost::none;
 
        // If needed, this will automatically clone all the documents along the
        // field path so that the end values are not shared across documents
        // that have come out of this pipeline operator.  This is a partial deep
        // clone. Because the value at the end will be replaced, everything
        // along the path leading to that will be replaced in order not to share
        // that change with any other clones (or the original).
 
        _output.setNestedField(_unwindPathFieldIndexes, _inputArray[_index]);
        _index++;
        return _output.peek();
    }

It therefore looks like data is being copied...

Generated at Thu Feb 08 03:36:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.