[SERVER-24631] TTL Monitor performance degradation on MongoDB 3.0 Created: 17/Jun/16  Updated: 30/Jan/17  Resolved: 19/Jul/16

Status: Closed
Project: Core Server
Component/s: MMAPv1
Affects Version/s: 3.0.12
Fix Version/s: 3.3.11

Type: Bug Priority: Major - P3
Reporter: Gregory Banks Assignee: Kevin Albertson
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

AWS t2.small
$> uname -a
Linux ip-10-0-0-121 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$> cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.4 LTS"


Attachments: File dump.tgz     PNG File ttl_read_avg.png     PNG File ttl_read_sum.png     Text File vmstat.log    
Issue Links:
Duplicate
is duplicated by SERVER-27830 TTL Monitor creates performance degra... Closed
Related
is related to SERVER-27830 TTL Monitor creates performance degra... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

To reproduce this behavior, you can spin up a t2.small instance on AWS, attach a 33 GB gp2 volume, deploy MongoDB 3.0.12, ensure that dbpath points to a a directory that lives on the attached volume, and, finally, restore the attached dump. You should see read behavior on that volume that resembles the attached graphs and vmstat dump with zero user activity.

Sprint: Integration 17 (07/15/16), Integration 18 (08/05/16)
Participants:

 Description   

Hi,

I’ve noticed a severe spike in resource utilization leading to degradation in performance on AWS since upgrading to 3.0 as a result of the way the TTLMonitor queries for indexes when using MMAPv1.

In 2.6, TTL indexes were collected by querying the system.indexes collection like so:

                auto_ptr<DBClientCursor> cursor =
                                db.query( dbName + ".system.indexes" ,
                                          BSON( secondsExpireField << BSON( "$exists" << true ) ) ,
                                          0 , /* default nToReturn */
                                          0 , /* default nToSkip */
                                          0 , /* default fieldsToReturn */
                                          QueryOption_SlaveOk ); /* perform on secondaries too */

In 3.0 system.indexes was deprecated and a new abstraction layer between the database and the storage engine was introduced. As a result, namespace file operations are now much less efficient. The issue I am seeing appears to be the result of the following bit of code:

    void iterAll(IteratorCallback callback) {
        for (int i = 0; i < n; i++) {
            if (_nodes(i).inUse()) {
                callback(_nodes(i).key, _nodes(i).value);
            }
        }
    }

which gets executed by the TTLMonitor via this code path:

mongo/db/storage/mmap_v1/catalog/hashtab.h - NamespaceHashTable::iterAll
mongo/db/storage/mmap_v1/catalog/namespace_index.cpp - NamespaceIndex::getCollectionNamespaces
mongo/db/storage/mmap_v1/mmap_v1_database_catalog_entry.cpp - MMAPV1DatabaseCatalogEntry::getCollectionNamespaces
mongo/db/ttl.cpp - TTLMonitor::getTTLIndexesForDB

As a result, the entire namespace file for every database gets pulled into memory every time the TTLMonitor executes (every 60 seconds by default). Of course, the default namespace file size is only 16MBs, so this really isn’t an issue in the most common case. However, if you want to set up a development environment for a number of users on a single host, you will find yourself scratching your head as to why performance is so bad. It should be noted that performance will be bad regardless of user activity, database size, or the presence of TTL indexes (all of which would only serve to exacerbate the situation).

In addition to the development use case, it is possible to run into similar issues with a single database that has many collections/indexes and requires a namespace file larger than the default (up to 2048 MBs). In this case, both the TTLMonito and any command that involves a namespace file scan (e.g., listCollections) will cause issues.

To reproduce this behavior, you can spin up a t2.small instance on AWS, attach a 33 GB gp2 volume, deploy MongoDB 3.0.12, ensure that dbpath points to a a directory that lives on the attached volume, and, finally, restore the attached dump. You should see read behavior on that volume that resembles the attached graphs and vmstat dump with zero user activity.

At the very least, I think this should be explicitly documented in order to save time/confusion on the part of developers/operations and to help with capacity planning and architecture decisions going forward.

Greg



 Comments   
Comment by Githook User [ 19/Jul/16 ]

Author:

{u'username': u'kevinAlbs', u'name': u'Kevin Albertson', u'email': u'kevin.albertson@10gen.com'}

Message: SERVER-24631: Add TTL collection namespace cache
Branch: master
https://github.com/mongodb/mongo/commit/d059552b998bd9f3ff0275016dff2df89a137b02

Comment by Gregory Banks [ 23/Jun/16 ]

Hey Thomas,

Awesome and no problem Thanks!

Cheers,
Greg

Comment by Kelsey Schubert [ 21/Jun/16 ]

Hi gregbanks,

I've confirmed that this issue affects MongoDB 3.2.7 as well, and I'm marking this ticket to be scheduled. Please continue to watch for updates.

Thank you again for the detailed steps to reproduce!
Thomas

Comment by Gregory Banks [ 17/Jun/16 ]

No problem. Please let me know if you need anything else!

Comment by Ramon Fernandez Marina [ 17/Jun/16 ]

Thanks for the detailed bug report gregbanks, we're investigating.

Generated at Thu Feb 08 04:06:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.