[SERVER-8954] Index Key Extraction Much Slower for Some Data Schemas Than Others Created: 12/Mar/13  Updated: 08/Dec/15  Resolved: 08/Dec/15

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: 2.2.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Henson Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 12.04 LTS


Attachments: File integer.svg     File object.svg     File test-indexes.py    
Issue Links:
Duplicate
duplicates SERVER-8192 Optimize btree key generation Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

To run the test case using objects in the array:
python test-indexes.py --object

To run the test case using integers in the array:
python test-indexes.py --integer

Participants:

 Description   

I have a collection that essentially is an _id and a list of objects of
complex type. There is an multikey index on this collection from the _id
to the id of each of the items in the list.

ie:

db.test.ensureIndex({_id: 1, 'items.id': 1})
db.test.insert({_id: ObjectId("...."),
items: [

{id: 1}

,

{id: 2}

,
....

{id: 1000}

,
]
})

With a small number of items in the list, insert and update times for an
individual item are reasonable, but once the number of items in the list
is greater than 1,000 the time to insert or update just one item updates
starts to slow down dramatically:

Inserted document with 1000 items in 0.048251 seconds
Updated document ($set) with 1000 items in 0.104173 seconds
Updated document ($push) with 1000 items in 0.318420 seconds
Inserted document with 2000 items in 0.199266 seconds
Updated document ($set) with 2000 items in 0.483723 seconds
Updated document ($push) with 2000 items in 1.026530 seconds
Inserted document with 3000 items in 0.593618 seconds
Updated document ($set) with 3000 items in 1.053177 seconds
Updated document ($push) with 3000 items in 2.245902 seconds
Inserted document with 4000 items in 0.991389 seconds
Updated document ($set) with 4000 items in 1.898991 seconds
Updated document ($push) with 4000 items in 4.001129 seconds
Inserted document with 5000 items in 1.490980 seconds
Updated document ($set) with 5000 items in 3.080210 seconds
Updated document ($push) with 5000 items in 6.076108 seconds
Inserted document with 6000 items in 2.144194 seconds
Updated document ($set) with 6000 items in 4.325883 seconds

I've attached a test program that creates the output described above. It
will insert a test document with an ever increasing number of items. It
will then $set the list on the newly inserted document to itself. After
that it will attempt to $push one new item onto the list.

I've run the same test above with integers as the list items instead
of an object. As the number of items increases the insert/update speed
slows down, but the performance doesn't degrade nearly as severly as it
does when using objects.



 Comments   
Comment by David Storch [ 08/Dec/15 ]

This looks like a duplicate of SERVER-8192, which was fixed in development version 3.1.1 and is first generally available in production version 3.2.0. Closing as a Duplicate.

Comment by Michael Henson [ 14/Mar/13 ]

Thanks for the follow-up Andy. We were able to work around this by removing the index on that field. We'll be more mindful of that limitation in the future.

Comment by Andy Schwerin [ 14/Mar/13 ]

michael@songza.com, this is a straight up performance issue having to do with how the indexing system extracts key data from documents during updates, inserts, etc. Improving the performance while maintaining existing behavior will be tricky, and because it involves durable data, will require extensive testing. I'm moving this issue into triage, but in the short term I would recommend altering your schema when arrays might be very large.

Comment by Andy Schwerin [ 13/Mar/13 ]

I repeated approximately michael@songza.com's experiment, and I used a build of mongod with the CPU profiler support enabled. Attached are SVG renderings of hot call paths. I tried to tune the two cases so that both took similar amounts of wall clock time, so the integer case has many more iterations in its profile than the object case.

Notice that in the "insert object" case 90% of the time is spent in BSONObj::getField and its callees. My judgement is that a lot of time is going into string comparisons and reparsing data from the raw bson representation.

Comment by Michael Henson [ 13/Mar/13 ]

I can understand both the cost of a document move as well as the cost of the initial indexing. The surprising thing here though, and I apologize for probably not being as clear as I could have about this, is that the cost of indexing a list of objects is so much greater than the cost of doing the same with a simple list of integers. Note the numbers below:

> ./test-indexes.py --object
Inserted 5000 in 1.005717 seconds
Updated ($set) 5000 in 1.799109 seconds
Updated/addone ($set) 5000 in 3.646754 seconds
Updated ($push) 5000 in 3.605636 seconds

> ./test-indexes.py --integer
Inserted 5000 in 0.059405 seconds
Updated ($set) 5000 in 0.059590 seconds
Updated/addone ($set) 5000 in 0.111611 seconds
Updated ($push) 5000 in 0.108521 seconds

Those numbers suggest that it's a little over 30 times faster to index a list of simple types than it is to index a list of complex types. While I probably would not be willing to have a 3.6 second lock in the database in a production setup, it might be acceptable in our use case for a 0.1 second lock on this particular dataset. Can you think of any reason for the large discrepancy between these two types of list items?

Comment by Bryan Reinero [ 12/Mar/13 ]

Hi Michael,

There are a couple of interesting mechanisms in play here:

The composite key declared is a multi-key index built on an array. Multi-key indexes contain an entry in the index b-tree for every element in the array. If you are indexing arrays that have many elements, the resultant index will be large. Now, since MongoDB keeps indexes consistent with the data they cover, any insertions and any subsequent updates to array means that the index will need to be updated too. The impact of updates will be proportional to the number of elements per document.

The effect is compounded when the indexed array is subject to growth. Documents inserted into MongoDB are stored adjacent to one another in an effort to maximize disk storage. However, if an update increases the size of the updated document, that document won't fit in its original space. The document will need to move on disk. Since MongoDB's index nodes contain the location of the document referenced, any move of document on disk means that the index must be updated. In the course of the "$push" test, the original document has 1000 elements with corresponding positions in the index b-tree, so the consequence of the $push of the additional 1000 elements means that the original document will move on disk, necessitating an update of the existing index nodes as well as the insertion of new nodes to cover the 1000 new elements.

Of course, multi-key indexes are a useful feature, but must be used with consideration to the size and maintenance costs they'll incur. Indexing strategies which include arrays subject to unbounded growth have the potential to grow to a very large size for any collection of significant size, and increase latency. This is why schema design is such an important aspect of using MongoDB efficiently. Large unbounded indexed arrays can be avoided by designing your schema with a more normalized strategy.

Generated at Thu Feb 08 03:18:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.