[SERVER-21928] Mongos uses too much memory Created: 17/Dec/15  Updated: 13/Jul/16  Resolved: 13/Jul/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.0.4, 3.0.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Steffen Assignee: Unassigned
Resolution: Done Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Description: Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty


Operating System: Linux
Participants:

 Description   

On one of our hosts the mongos process uses a lot memory until the OOM kills it.
This host is used to call php scripts (cron based) which talk to this mongos router. So we think it's related to our php scripts (php driver?), since other mongos don't have this issue.
Memory is increasing over time, but it's not easy to say when.

Our Cluster uses mongodb version 3.0.4. We have 25 shards.
Here are some system insights:

$ uname -a
Linux hostname.domain.net 3.13.0-49-generic #83-Ubuntu SMP Fri Apr 10 20:11:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
 
$ free -m
             total       used       free     shared    buffers     cached
Mem:         64321      44405      19915         55        701      10353
-/+ buffers/cache:      33350      30970
Swap:          975        975          0
 
$ mongostat --port 27017
insert query update delete getmore command flushes mapped vsize   res faults qr|qw ar|aw netIn netOut conn set repl     time
    *0    *0     *0     *0       0     2|0       0        24.5G 21.1G      0   0|0   0|0  133b     6k    5      RTR 12:52:52
    *0    *0     *0     *0       0     1|0       0        24.5G 21.1G      0   0|0   0|0   79b     6k    5      RTR 12:52:53
    *0    *0     *0     *0       0     1|0       0        24.5G 21.1G      0   0|0   0|0   79b     6k    5      RTR 12:52:54
    *0    *0     *0     *0       0     2|0       0        24.5G 21.1G      0   0|0   0|0  137b     7k    5      RTR 12:52:55
    *0    *0     *0     *0       0     2|0       0        24.5G 21.1G      0   0|0   0|0  133b     6k    5      RTR 12:52:56
 
$ mongos --version
MongoS version 3.0.4 starting: pid=100831 port=27017 64-bit host=hostname.domain.net (--help for usage)
git version: 0481c958daeb2969800511e7475dc66986fa9ed5
OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
build sys info: Linux ip-10-45-73-23 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
 
$ ps auxw |grep -v grep |grep mongos
mongodb  133123 12.4 33.5 25685928 22088344 ?   Sl   Dec11 1089:22 /usr/bin/mongos --config /etc/mongodb/mongodb-router-27017.conf --logpath /var/log/mongodb/mongodb-router-27017.log
 
$ php -v
PHP 5.6.16-2+deb.sury.org~trusty+1 (cli)
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies
    with Zend OPcache v7.0.6-dev, Copyright (c) 1999-2015, by Zend Technologies
 
$ pecl list
Installed packages, channel pecl.php.net:
=========================================
Package  Version State
apcu     4.0.8   stable
imagick  3.1.2   stable
mogilefs 0.9.2   beta
mongo    1.6.9   stable
msgpack  0.5.6   beta
stats    1.0.3   stable
uuid     1.0.3   stable
xhprof   0.9.4   beta

At this point after 6 days running it uses 22GB of memory with 5 active connections.

We also tested version 3.0.7 of mongos on this host with same result.

We think this is a bug in the mongos, that it doesn't clean up memory.



 Comments   
Comment by Kelsey Schubert [ 13/Jul/16 ]

Hi steffen,

Thanks for the clarification. From our investigation, we haven't seen enough information here to indicate a bug in either MongoDB server or in the PHP driver. There are a number of possible explanations for the memory consumption you are observing which would originate from the application layer (eg. too many open cursors).

For MongoDB-related support discussion please post on the mongodb-users group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group. Please see also our Technical Support page for additional support resources.

Kind regards,
Thomas

Comment by Steffen [ 28/Apr/16 ]

Hi,

our router didn't restart since Tue Feb 9 09:11:50 2016.
It's using 11750344 bytes of memory at the moment.
we are using driver versions:
php mongo 1.6.13
mongos 3.0.9
We also switched most of our mongod servers to use WT.

This happened only on mongos where we run our php scripts against.

OOM may not be triggered any more since memory footprint is lower or we don't leak unlimited.
I still think that 11.2GB of memory is not a desired memory footprint for a mongos, with 5 connections.

Comment by Kelsey Schubert [ 25/Apr/16 ]

Hi rschwarzberg and steffen,

We still need the information Ramon requested to diagnose the problem. If this is still an issue for you, can you please upload the logs and clarify whether all of your mongos are affected?

Thank you,
Thomas

Comment by Ramon Fernandez Marina [ 28/Mar/16 ]

rschwarzberg, can you please upload logs for an affected mongos from startup until it gets killed? Also, are all your mongos affected or only some of them?

steffen, are you still having some mongos processes killed by the OOM?

Comment by Robert Schwarzberg [ 10/Feb/16 ]

Added MongoClient::killCursor() after every time a batch was processed (currently 1000 items per query) and then created a new one for the next batch. It looks like the problem still exists, though.

Comment by Steffen [ 08/Feb/16 ]

We will reevaluate this.

Comment by Ramon Fernandez Marina [ 06/Feb/16 ]

Have you tried Jeremy's suggestion above of using MongoClient::killCursor()?

Since your other mongos are not being impacted this all points out to too many cursors being accumulated by the execution of these scripts.

Comment by Steffen [ 03/Feb/16 ]

Hello,

we upgraded to mongodb 3.0.9 and we still have the memory issues.
What should we do?

Comment by Steffen [ 22/Dec/15 ]

We are running another test and we don't hit exceptions so far. Still same result on mongo router.
Code change:

            } catch (\MongoCursorException $e) {
                $this->logger->error("Catched cursor exception: {$e->getMessage()}");
                $this->documentManager->getConnection()->close();
                $done = false;
            }

before start:
    *0    *0     *0     *0       0     2|0       0        133.0M 16.0M      0   0|0   0|0  133b     6k    3      RTR 09:07:34
short after start:
    *0   138     *0     *0       0    10|0       0        156.0M 37.0M     13   0|0   0|0   33k     6m    4      RTR 09:28:37
    *0   271     *0     *0       0     2|0       0        203.0M 83.0M      0   0|0   0|0   64k    11m    4      RTR 09:28:38
    *0   276     *0     *0       0     1|0       0        249.0M 128.0M      0   0|0   0|0   65k    11m    4      RTR 09:28:39
    *0   274     *0     *0       0     2|0       0        293.0M 173.0M      0   0|0   0|0   64k    11m    4      RTR 09:28:40
insert query update delete getmore command flushes mapped  vsize    res faults qr|qw ar|aw netIn netOut conn set repl     time
    *0   274     *0     *0       0     2|0       0        338.0M 218.0M      0   0|0   0|0   64k    11m    4      RTR 09:28:41
    *0   257     *0     *0       0     2|0       0        380.0M 260.0M      0   0|0   0|0   60k    10m    4      RTR 09:28:42
    *0   254     *0     *0       0     2|0       0        422.0M 302.0M      0   0|0   0|0   60k    10m    4      RTR 09:28:43
    *0   266     *0     *0       0     1|0       0        465.0M 346.0M      0   0|0   0|0   62k    11m    4      RTR 09:28:44
    *0   260     *0     *0       0     8|0       0        509.0M 388.0M      0   0|0   0|0   61k    11m    4      RTR 09:28:45
    *0   263     *0     *0       0     2|0       0        552.0M 434.0M      0   0|0   0|0   62k    11m    4      RTR 09:28:46
    *0   263     *0     *0       0     1|0       0        595.0M 477.0M      0   0|0   0|0   62k    11m    4      RTR 09:28:47
while running longer limits to 40.1G:
    *0    13     *0     13       0    15|0       0         40.4G  40.1G      0   0|0   0|0    4k   200k    5      RTR 10:04:19

Again Exceptions in mongos log:

First exception 10 minutes after start
2015-12-22T09:38:44.016+0000 I QUERY    [cursorTimeout] killing old cursor 314751932572947731 idle for: 601903ms
...
2015-12-22T10:06:02.476+0000 I QUERY    [cursorTimeout] killing old cursor 321733509092139587 idle for: 600644ms

Just heard that we are using a fork of this https://github.com/researchgate/mongodb-odm doctrine.
It's possible that we have a bug in Doctrine and/or mongo router.

Script is finished and we hit no exception. Memory usage on router still high

    *0    *0     *0     *0       0     2|0       0         40.4G  40.1G      0   0|0   0|0  133b     6k    4      RTR 11:10:36

Comment by Jeremy Mikola [ 21/Dec/15 ]

Do you have any insight into whether this catch block is being reached, and what the exceptions might be?

} catch (\MongoCursorException $e) {
    $done = false;
}

If the remove operation were to throw, that would abandon the original cursor you were iterating and it's possible that there were still batches waiting to be fetched on the server side (i.e. the cursor would be left open). This may be a case where you'd be justified in using MongoClient::killCursor() to clean up the old cursor manually before you start a new query to pick up where you left off. Since you're using a foreach, it wouldn't be feasible to simply restart iteration on the same $cursor, since that would re-execute a new query.

Comment by Steffen [ 21/Dec/15 ]

We tested with another router on a different host. Skript was still started and running on same host as before. I restarted this router before we started with the test so memory is clean.

    *0    *0     *0     *0       0     7|0       0        127.0M 14.0M      0   0|0   0|0  450b    58k    3      RTR 11:31:41

We start the skript and memory usage begins to rise

    *0    23     *0     24       0    26|0       0         45.9G  45.7G      0   0|0   0|0    7k   484k    4      RTR 11:55:16

In the router log I saw these interesting messages:

2015-12-21T11:54:49.253+0000 I QUERY    [cursorTimeout] killing old cursor 3552240116488939 idle for: 601318ms
2015-12-21T11:54:49.253+0000 I QUERY    [cursorTimeout] killing old cursor 3552412520747275 idle for: 601278ms
2015-12-21T11:54:49.253+0000 I QUERY    [cursorTimeout] killing old cursor 3552583107550821 idle for: 601238ms
2015-12-21T11:54:49.253+0000 I QUERY    [cursorTimeout] killing old cursor 3552799049369721 idle for: 601188ms
2015-12-21T11:54:49.253+0000 I QUERY    [cursorTimeout] killing old cursor 3552969851985593 idle for: 601148ms
2015-12-21T11:54:49.253+0000 I QUERY    [cursorTimeout] killing old cursor 3553140936241862 idle for: 601108ms
2015-12-21T11:54:49.253+0000 I QUERY    [cursorTimeout] killing old cursor 3553314120899711 idle for: 601068ms
2015-12-21T11:54:49.253+0000 I QUERY    [cursorTimeout] killing old cursor 3553485397727544 idle for: 601028ms
2015-12-21T11:54:49.253+0000 I QUERY    [cursorTimeout] killing old cursor 3557523255107532 idle for: 600088ms

After we stopped the script memory usage stays at high level.

Cron started the script another time and the router was killed
last log line before restart:

2015-12-21T12:55:14.091+0000 I QUERY    [cursorTimeout] killing old cursor 18993008554380031 idle for: 600005ms
2015-12-21T12:56:34.229+0000 I CONTROL  ***** SERVER RESTARTED *****

[3620790.136020] Out of memory: Kill process 105596 (mongos) score 795 or sacrifice child
[3620790.136267] Killed process 105596 (mongos) total-vm:53190792kB, anon-rss:53060356kB, file-rss:624kB

Here is the code which the scripts run:

    public function cleanupOldEvents(\DateInterval $dateInterval) {
        $date = new \DateTime();
        $date->sub($dateInterval);
        $removeCount = 0;
        $lastSeenId = 0;
        $itemsChecked = 0;
        $queryBuilder = $this->documentManager->createQueryBuilder(QueuedEvent::class);
 
        do {
           try {
                if (!$lastSeenId) {
                    $query = $queryBuilder
                        ->find()
                        ->sort('_id')
                        ->limit(self::BATCH_SIZE)
                        ->getQuery();
                } else {
                    $query = $queryBuilder
                        ->find()
                        ->limit(self::BATCH_SIZE)
                        ->field('_id')
                        ->sort('_id')
                        ->gt($lastSeenId)
                        ->getQuery();
                }
                /** @var \rg\ODM\MongoDB\Query\Query $query */
                $query->setHydrate(false);
                $cursor = $query->execute();
 
                $done = true;
                /** @var QueuedEvent $item */
                foreach ($cursor as $item) {
                    $itemsChecked++;
                    $lastSeenId = $item['_id'];
                    $done = false;
 
                    $result = $queryBuilder
                        ->remove()
                        ->field('_id')->equals($item['_id'])
                        ->getQuery()
                        ->execute();
                    if ($result['ok']) {
                        $removeCount++;
                    }
                }
 
            } catch (\MongoCursorException $e) {
                $done = false;
            }
        } while (!$done);
        return $removeCount;
    }

Comment by Steffen [ 21/Dec/15 ]

Regarding our PHP Mongo Connections:

we just use simple

new \MongoClient($uri)

Also we follow this suggestion http://stackoverflow.com/questions/17839962/persistent-connection-or-connection-pooling-in-php54-nginx-phpfpm-mongodb/17840527#17840527

When our PHP script is finished, connections will be closed. Most of our scripts run for a short amount of time. Some for hours.

Test result (another router) will follow today.

Comment by Steffen [ 18/Dec/15 ]

FYI: We will do some test on Monday.

Comment by Jeremy Mikola [ 17/Dec/15 ]

If possible, can you share how the driver is being used from the PHP script? With only five open connections on mongos, I'd assume the script isn't abandoning open connections (as it might with a client-side socket timeout).

Is this only an issue when mongos is running on the same machine as the PHP cron jobs? Would it be possible to test if the same PHP cron jobs connecting to a remote mongos trigger the memory increases (obviously, this would come at the cost of an extra network hop)?

Generated at Thu Feb 08 03:58:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.