[SERVER-8954] Index Key Extraction Much Slower for Some Data Schemas Than Others Created: 12/Mar/13 Updated: 08/Dec/15 Resolved: 08/Dec/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance |
| Affects Version/s: | 2.2.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Michael Henson | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 12.04 LTS |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | To run the test case using objects in the array: To run the test case using integers in the array: |
||||||||
| Participants: | |||||||||
| Description |
|
I have a collection that essentially is an _id and a list of objects of ie: db.test.ensureIndex({_id: 1, 'items.id': 1}) , , , With a small number of items in the list, insert and update times for an Inserted document with 1000 items in 0.048251 seconds I've attached a test program that creates the output described above. It I've run the same test above with integers as the list items instead |
| Comments |
| Comment by David Storch [ 08/Dec/15 ] |
|
This looks like a duplicate of |
| Comment by Michael Henson [ 14/Mar/13 ] |
|
Thanks for the follow-up Andy. We were able to work around this by removing the index on that field. We'll be more mindful of that limitation in the future. |
| Comment by Andy Schwerin [ 14/Mar/13 ] |
|
michael@songza.com, this is a straight up performance issue having to do with how the indexing system extracts key data from documents during updates, inserts, etc. Improving the performance while maintaining existing behavior will be tricky, and because it involves durable data, will require extensive testing. I'm moving this issue into triage, but in the short term I would recommend altering your schema when arrays might be very large. |
| Comment by Andy Schwerin [ 13/Mar/13 ] |
|
I repeated approximately michael@songza.com's experiment, and I used a build of mongod with the CPU profiler support enabled. Attached are SVG renderings of hot call paths. I tried to tune the two cases so that both took similar amounts of wall clock time, so the integer case has many more iterations in its profile than the object case. Notice that in the "insert object" case 90% of the time is spent in BSONObj::getField and its callees. My judgement is that a lot of time is going into string comparisons and reparsing data from the raw bson representation. |
| Comment by Michael Henson [ 13/Mar/13 ] |
|
I can understand both the cost of a document move as well as the cost of the initial indexing. The surprising thing here though, and I apologize for probably not being as clear as I could have about this, is that the cost of indexing a list of objects is so much greater than the cost of doing the same with a simple list of integers. Note the numbers below: > ./test-indexes.py --object > ./test-indexes.py --integer Those numbers suggest that it's a little over 30 times faster to index a list of simple types than it is to index a list of complex types. While I probably would not be willing to have a 3.6 second lock in the database in a production setup, it might be acceptable in our use case for a 0.1 second lock on this particular dataset. Can you think of any reason for the large discrepancy between these two types of list items? |
| Comment by Bryan Reinero [ 12/Mar/13 ] |
|
Hi Michael, There are a couple of interesting mechanisms in play here: The composite key declared is a multi-key index built on an array. Multi-key indexes contain an entry in the index b-tree for every element in the array. If you are indexing arrays that have many elements, the resultant index will be large. Now, since MongoDB keeps indexes consistent with the data they cover, any insertions and any subsequent updates to array means that the index will need to be updated too. The impact of updates will be proportional to the number of elements per document. The effect is compounded when the indexed array is subject to growth. Documents inserted into MongoDB are stored adjacent to one another in an effort to maximize disk storage. However, if an update increases the size of the updated document, that document won't fit in its original space. The document will need to move on disk. Since MongoDB's index nodes contain the location of the document referenced, any move of document on disk means that the index must be updated. In the course of the "$push" test, the original document has 1000 elements with corresponding positions in the index b-tree, so the consequence of the $push of the additional 1000 elements means that the original document will move on disk, necessitating an update of the existing index nodes as well as the insertion of new nodes to cover the 1000 new elements. Of course, multi-key indexes are a useful feature, but must be used with consideration to the size and maintenance costs they'll incur. Indexing strategies which include arrays subject to unbounded growth have the potential to grow to a very large size for any collection of significant size, and increase latency. This is why schema design is such an important aspect of using MongoDB efficiently. Large unbounded indexed arrays can be avoided by designing your schema with a more normalized strategy. |