Uploaded image for project: 'PHP Legacy Driver'
  1. PHP Legacy Driver
  2. PHP-1164

If a connection times out on the client side it is left open in state CLOSE_WAIT on server side

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Won't Fix
    • Affects Version/s: 1.5.5
    • Fix Version/s: None
    • Component/s: pecl-mongo
    • Labels:
      None
    • Environment:
      Clients: Ubuntu 14.04 / Mac OSX, PHP-FPM 5.4 / 5.5
      Server: Ubuntu 14.04, Mongo 2.6.4
    • # Replies:
      9
    • Last comment by Customer:
      false

      Description

      We upgraded servers from 2.4.x to 2.6.4 and this issue came up.

      We have MongoCursor::$timeout=20000 in our scripts. Unfortunately some queries do sometimes take that long and are then killed. This worked fine until upgrade. After upgrade the server does not close these connections and they are left in state CLOSE_WAIT. The server gradually becomes unresponsive and has to be restarted. Also if the client stops sending requests, the connections are closed (noticed this when trying to reproduce the situation in testing). Here is a simple script I used to reproduce this error:

      <?php
      MongoCursor::$timeout = 100;
      $mc = new MongoClient("mongodb://test-server");
      $coll = $db->test->collection;
      $reg = new MongoRegex("/^Text/i");
      $it = $coll->find(array("textfield"=> $reg));
      var_dump(iterator_to_array($it));
      ?>

      I then used siege (or ab) to stress this script with 10 concurrent connections. mongostat showed open connections on the server side going up from 18 to 1140 Performing other queries while this was going on (without the timeout) was also impossible.

      We also tried setting $timeout to -1 in production, but we then hit PHPs max_execution_time which eventually caused the same affect.

        Activity

        Hide
        jmikola Jeremy Mikola added a comment -

        Edmunds Kalnins: Is this the exact script you're using?

        <?php
        MongoCursor::$timeout = 100;
        $mc = new MongoClient("mongodb://test-server");
        $coll = $db->test->collection;
        $reg = new MongoRegex("/^Text/i");
        $it = $coll->find(array("textfield"=> $reg));
        var_dump(iterator_to_array($it));
        ?>

        I ask because $db isn't defined. It's clear that it should be $mc, but I'd like to confirm that there are no other differences between this paste and the script you're actually using to reproduce the issue.

        Show
        jmikola Jeremy Mikola added a comment - Edmunds Kalnins : Is this the exact script you're using? <?php MongoCursor::$timeout = 100; $mc = new MongoClient("mongodb://test-server"); $coll = $db->test->collection; $reg = new MongoRegex("/^Text/i"); $it = $coll->find(array("textfield"=> $reg)); var_dump(iterator_to_array($it)); ?> I ask because $db isn't defined. It's clear that it should be $mc , but I'd like to confirm that there are no other differences between this paste and the script you're actually using to reproduce the issue.
        Hide
        edmnc Edmunds Kalnins added a comment -

        Sorry, I made some changes I didn't test in the script to make it shorter for pasting. Yes, change to $db to $mc. To be 100% sure here is the actual script:

        <?php
         
        if (empty($_GET['d'])) {
        	MongoCursor::$timeout = 100;
        }
         
        $mc = new MongoClient("mongodb://test-server");
         
        $db = $mc->testdb;
        $coll = $db->testcoll;
         
        if (empty($_GET['d'])) {
        	$reg = new MongoRegex("/^Edmunds/i");
        	$it = $coll->find(array("profile.name"=> $reg));
        }
         
        else {
        	$it = $coll->find(array("profile.name"=> 'Edmunds'));
        }
        var_dump(iterator_to_array($it));
         
        ?>

        By passing ?d=1 to request I simulated "normal" requests that would not hit $timeout

        Show
        edmnc Edmunds Kalnins added a comment - Sorry, I made some changes I didn't test in the script to make it shorter for pasting. Yes, change to $db to $mc. To be 100% sure here is the actual script: <?php   if ( empty ( $_GET [ 'd' ])) { MongoCursor:: $timeout = 100; }   $mc = new MongoClient( "mongodb://test-server" );   $db = $mc ->testdb; $coll = $db ->testcoll;   if ( empty ( $_GET [ 'd' ])) { $reg = new MongoRegex( "/^Edmunds/i" ); $it = $coll ->find( array ( "profile.name" => $reg )); }   else { $it = $coll ->find( array ( "profile.name" => 'Edmunds' )); } var_dump(iterator_to_array( $it ));   ?> By passing ?d=1 to request I simulated "normal" requests that would not hit $timeout
        Hide
        jmikola Jeremy Mikola added a comment - - edited

        Test plan

        For some fixture data, I populated a collection with 100,000 documents containing arrays paragraph text. The import script (and composer.json file) are included in this gist. I then created a script to benchmark through nginx and php-fpm, which would connect and issue a query that would surely timeout. socketTimeoutMS is the preferred way to specify client-side timeouts in the driver (MongoCursor::$timeout has been deprecated for some time). The script query follows:

        <?php
         
        $mc = new MongoClient(null, ['socketTimeoutMS' => 100]);
        $coll = $mc->test->php1164;
        $cursor = $coll->find()->sort(['text' => 1]);
        foreach ($cursor as $_) {}

        I initially looked into using $where with a JavaScript sleep() statement, which we do frequently in the driver test suite for simulating timeouts, but that didn't work well with the concurrency testing.

        php-fpm pool configuration

        I gave FPM a dynamic pool, allowing it to spawn up to 1000 workers. After a fresh restart, the pool starts with 5 servers (also the min spares we keep around). We'll see the pool size settle at 50 (the max spares) after the benchmark completes and the pool winds down.

        cat /etc/php5/fpm/pool.d/www.conf 
        [www]
        listen = /var/run/php5-fpm.sock
        listen.owner = jmikola
        listen.group = www-data
        listen.mode = 0666
        user = jmikola
        group = www-data
         
        pm = dynamic
        pm.max_children = 1000
        pm.min_spare_servers = 5
        pm.max_spare_servers = 50
        pm.start_servers = 5

        ab output

        I executed 100,000 requests (concurrency 500) through ab between 12:32:14 and 12:32:24 (approximately 10 seconds). The "length failures" reported are not significant (see: here).

        So we can correlate this with the mongostat output below, I've wrapped the benchmark run with date output.

        $ date; ab -n 100000 -c 500 http://php1164.local/; date
        Tue Aug 12 12:32:14 EDT 2014
        This is ApacheBench, Version 2.3 <$Revision: 1430300 $>
        Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
        Licensed to The Apache Software Foundation, http://www.apache.org/
         
        Benchmarking php1164.local (be patient)
        Completed 10000 requests
        Completed 20000 requests
        Completed 30000 requests
        Completed 40000 requests
        Completed 50000 requests
        Completed 60000 requests
        Completed 70000 requests
        Completed 80000 requests
        Completed 90000 requests
        Completed 100000 requests
        Finished 100000 requests
         
         
        Server Software:        nginx/1.4.1
        Server Hostname:        php1164.local
        Server Port:            80
         
        Document Path:          /
        Document Length:        181 bytes
         
        Concurrency Level:      500
        Time taken for tests:   9.543 seconds
        Complete requests:      100000
        Failed requests:        2682
           (Connect: 0, Receive: 0, Length: 2682, Exceptions: 0)
        Write errors:           0
        Non-2xx responses:      97318
        Total transferred:      34955051 bytes
        HTML transferred:       18938959 bytes
        Requests per second:    10478.66 [#/sec] (mean)
        Time per request:       47.716 [ms] (mean)
        Time per request:       0.095 [ms] (mean, across all concurrent requests)
        Transfer rate:          3576.97 [Kbytes/sec] received
         
        Connection Times (ms)
                      min  mean[+/-sd] median   max
        Connect:        0   12 168.4      0    7016
        Processing:     1   30 135.0     11    4317
        Waiting:        0   29 135.0     10    4317
        Total:          2   42 224.8     11    7917
         
        Percentage of the requests served within a certain time (ms)
          50%     11
          66%     12
          75%     13
          80%     14
          90%     18
          95%     23
          98%    452
          99%   1221
         100%   7917 (longest request)
        Tue Aug 12 12:32:24 EDT 2014

        mongostat output

        $ mongostat
        connected to: 127.0.0.1
        insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
            *0     *0     *0     *0       0     2|0       0   352m  11.9g  5.83g      0  test:0.0%          0       0|0     0|0   124b     7k     2   12:32:10 
            *0     *0     *0     *0       0     2|0       0   352m  11.9g  5.83g      0  test:0.0%          0       0|0     0|0   124b     7k     2   12:32:11 
            *0     *0     *0     *0       0     2|0       0   352m  11.9g  5.83g      0  test:0.0%          0       0|0     0|0   124b     7k     2   12:32:12 
            *0     *0     *0     *0       0     2|0       0   352m  11.9g  5.83g      0  test:0.0%          0       0|0     0|0   124b     7k     2   12:32:13 
            *0     *0     *0     *0       0     2|0       0   352m  11.9g  5.83g      0  test:0.0%          0       0|0     0|0   124b     7k     2   12:32:14 
            *0    100     *0     *0       0   323|0       0   352m  12.1g  5.83g      0  test:0.0%          0      21|0    59|0    19k   107k   196   12:32:15 
            *0     59     *0     *0       0   642|0       0   352m  12.1g  5.83g      0  test:0.0%          0     105|0   132|0    39k   150k   208   12:32:16 
            *0      3     *0     *0       0   223|0       0   352m  12.1g  5.83g      0  test:0.0%          0     125|1   160|0    13k    50k   191   12:32:18 
            *0     69     *0     *0       0   646|0       0   352m  12.2g  5.83g      0  test:0.0%          0      86|0   135|0    39k   164k   266   12:32:19 
            *0     *0     *0     *0       0   253|0       0   352m  12.2g  5.83g      0  test:0.0%          0      97|0   167|0    15k    55k   278   12:32:20 
        insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
            *0     30     *0     *0       0   267|0       0   352m  12.2g  5.83g      0  test:0.0%          0     176|0   236|0    16k    69k   321   12:32:21 
            *0      7     *0     *0       0   289|0       0   352m  12.2g  5.83g      0  test:0.0%          0     266|1   269|0    17k    63k   452   12:32:23 
            *0     53     *0     *0       0   716|0       0   352m  12.2g  5.83g      0  test:0.0%          0     186|0   214|0    44k   166k   323   12:32:24 
            *0     *0     *0     *0       0     1|0       0   352m  12.2g  5.83g      0  test:0.1%          0     159|0   240|0   124b     7k   323   12:32:25 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.2%          0     184|0   258|0   124b     7k   323   12:32:26 
            *0     *0     *0     *0       0     3|0       0   352m  12.2g  5.83g      0  test:0.0%          0     210|0   291|0   124b     7k   323   12:32:27 
            *0     *0     *0     *0       0     1|0       0   352m  12.2g  5.83g      0  test:0.1%          0     273|0   305|0   124b     7k   323   12:32:28 
            *0     *0     *0     *0       0     3|0       0   352m  12.2g  5.83g      0  test:0.0%          0     244|1   317|0   186b    10k   323   12:32:29 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0      69|0   250|0   124b     7k   323   12:32:31 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.2%          0     119|0   238|0   124b     7k   323   12:32:32 
        insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0     150|0   236|0   124b     7k   323   12:32:33 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0     135|0   223|0   124b     7k   323   12:32:34 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.1%          0     178|0   248|0   124b     7k   323   12:32:35 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0     243|0   263|0   124b     7k   323   12:32:36 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.1%          0     215|0   278|0   124b     7k   323   12:32:37 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.2%          0     251|0   297|0   124b     7k   323   12:32:38 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0     260|0   300|0   124b     7k   323   12:32:39 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.2%          0     289|0   310|0   124b     7k   323   12:32:40 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0       0|0   321|0   124b     7k   323   12:32:41 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.1%          0       0|0   318|0   124b     7k   323   12:32:42 
        insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
            *0     *0     *0     *0       0     2|0       1   352m  12.2g  5.83g      0  test:0.0%          0       0|0   321|0   124b     7k   323   12:32:43 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0       0|0   321|0   124b     7k   323   12:32:44 
            *0      5     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.1%          0       0|0   314|0   124b     7k   323   12:32:45 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0       0|0   306|0   124b     7k   323   12:32:46 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0       0|0   321|0   124b     7k   323   12:32:47 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0       0|0   321|0   124b     7k   323   12:32:48 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0       0|0   317|0   124b     7k   323   12:32:49 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0       0|0   321|0   124b     7k   323   12:32:50 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.2%          0       0|0   313|0   212b     7k   322   12:32:51 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0       0|0   319|0   212b     7k   321   12:32:52 
        insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.83g      0  test:0.0%          0      30|0   118|0   300b     7k   319   12:32:53 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.84g      0  test:0.0%          0       0|0   287|0   388b     7k   316   12:32:54 
            *0     *0     *0     *0       0     2|0       0   352m  12.2g  5.84g      0  test:0.0%          0       0|0   299|0     1k     9k   304   12:32:55 
            *0     *0     *0     *0       0     3|0       0   352m  12.2g  5.84g      0  test:0.0%          0       0|0   284|0     1k     9k   286   12:32:56 
            *0     *0     *0     *0       0     1|0       0   352m  12.2g  5.84g      0  test:0.0%          0       0|0   263|0     1k    10k   265   12:32:57 
            *0     *0     *0     *0       0     2|0       0   352m  12.1g  5.84g      0  test:0.0%          0       0|0   232|0     2k    12k   235   12:32:58 
            *0     *0     *0     *0       0     2|0       0   352m  12.1g  5.84g      0  test:0.0%          0       0|0   184|0     4k    15k   186   12:32:59 
            *0     *0     *0     *0       0     2|0       0   352m    12g  5.84g      0  test:0.0%          0       0|0   124|0     5k    17k   126   12:33:00 
            *0     *0     *0     *0       0     2|0       0   352m    12g  5.85g      0  test:0.0%          0       0|0    43|0     7k    20k    46   12:33:01 
            *0     *0     *0     *0       0     2|0       0   352m  11.9g  5.85g      0  test:0.0%          0       0|0     0|0     3k    14k     2   12:33:02 
        insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
            *0     *0     *0     *0       0     2|0       0   352m  11.9g  5.85g      0  test:0.0%          0       0|0     0|0   124b     7k     2   12:33:03 
            *0     *0     *0     *0       0     2|0       0   352m  11.9g  5.85g      0  test:0.0%          0       0|0     0|0   124b     7k     2   12:33:04 
            *0     *0     *0     *0       0     2|0       0   352m  11.9g  5.85g      0  test:0.0%          0       0|0     0|0   124b     7k     2   12:33:05 
            *0     *0     *0     *0       0     2|0       0   352m  11.9g  5.85g      0  test:0.0%          0       0|0     0|0   124b     7k     2   12:33:06

        netstat output

        I used netstat to report on the number of connections lingering in a CLOSE_WAIT state. Again, each query here includes a date so we can correlate these numbers with the output from ab and mongostat.

        $ date; netstat -tonp 2>&1 | grep CLOSE | grep 127.0.0.1:27017 | wc -l
        Tue Aug 12 12:32:12 EDT 2014
        0
        [honeydew: ~] $ date; netstat -tonp 2>&1 | grep CLOSE | grep 127.0.0.1:27017 | wc -l
        Tue Aug 12 12:32:28 EDT 2014
        321
        [honeydew: ~] $ date; netstat -tonp 2>&1 | grep CLOSE | grep 127.0.0.1:27017 | wc -l
        Tue Aug 12 12:32:55 EDT 2014
        295
        [honeydew: ~] $ date; netstat -tonp 2>&1 | grep CLOSE | grep 127.0.0.1:27017 | wc -l
        Tue Aug 12 12:33:05 EDT 2014
        0

        Review

        After the benchmark completes, we see 321 connections in CLOSE_WAIT, which matches up with mongostat reporting 323 open connections (the extra two are those we had present before running the benchmark). During the benchmark, we had connections rise and fall based on load. After it completes, the active connections hang around in a CLOSE_WAIT state for about 30 seconds, after which point the server starts closing them. During that closing period, FPM's worker pool is still above the max spare servers and I found that it was reaping one or two worker processes a second until it reached pm.max_spare_servers of 50.

        I looked into why the connections were being closed server-side and it was due to memory limits reached during my sort queries:

        2014-08-12T13:47:17.514-0400 [conn752] ERROR: Runner error, stats:
        { "type" : "SORT",
          "works" : 38771,
          "yields" : 514,
          "unyields" : 514,
          "invalidates" : 0,
          "advanced" : 0,
          "needTime" : 38769,
          "needFetch" : 0,
          "isEOF" : 0,
          "forcedFetches" : 0,
          "memUsage" : 33555076,
          "memLimit" : 33554432,
          "children" : [ 
            { "type" : "COLLSCAN",
              "works" : 38769,
              "yields" : 514,
              "unyields" : 514,
              "invalidates" : 0,
              "advanced" : 38768,
              "needTime" : 1,
              "needFetch" : 0,
              "isEOF" : 0,
              "docsTested" : 38768,
              "children" : [] } ] }
        2014-08-12T13:47:17.872-0400 [conn752] assertion 17144 Runner error: Overflow sort stage buffered data usage of 33555076 bytes exceeds internal limit of 33554432 bytes ns:test.php1164 query:{ $query: {}, $orderby: { text: 1 } }
        2014-08-12T13:47:17.873-0400 [conn752] query test.php1164 query: { $query: {}, $orderby: { text: 1 } } planSummary: COLLSCAN ntoreturn:0 keyUpdates:0 exception: Runner error: Overflow sort stage buffered data usage of 33555076 bytes exceeds internal limit of 33554432 bytes code:17144 numYields:245 locks(micros) r:127374658 reslen:158 93891ms
        2014-08-12T13:47:17.873-0400 [conn752] end connection 127.0.0.1:51812 (857 connections now open)

        This actually didn't happen when I repeated the test against 2.4.10. Instead I would see the queries pile up and continue to run (visible with db.currentOp()) long after the PHP driver timed out. Running multiple benchmarks in a row could easily cause the connections to accumulate into the thousands (instead of the 300-400 peaks we saw in the stats above).

        I believe one problem here is that the driver isn't properly closing its connections when the worker process shuts down. We will look into this behavior and follow up after some more investigation.

        In the meantime, you may have some luck using $maxTimeMS in your queries (also available via MongoCursor::maxTimeMS()). This query operator was added in 2.6 and allows you to limit the amount of time a query can run on the server side. It functions a bit differently than the client-side timeout, as its limit applies to total execution time of the query (across multiple batches) – this is mentioned in a paragraph on the PHP method documentation. Still, it should give you some immediate control over the situation until we can look into the driver some more.

        Show
        jmikola Jeremy Mikola added a comment - - edited Test plan For some fixture data, I populated a collection with 100,000 documents containing arrays paragraph text. The import script (and composer.json file) are included in this gist . I then created a script to benchmark through nginx and php-fpm, which would connect and issue a query that would surely timeout. socketTimeoutMS is the preferred way to specify client-side timeouts in the driver ( MongoCursor::$timeout has been deprecated for some time). The script query follows: <?php   $mc = new MongoClient(null, ['socketTimeoutMS' => 100]); $coll = $mc->test->php1164; $cursor = $coll->find()->sort(['text' => 1]); foreach ($cursor as $_) {} I initially looked into using $where with a JavaScript sleep() statement, which we do frequently in the driver test suite for simulating timeouts, but that didn't work well with the concurrency testing. php-fpm pool configuration I gave FPM a dynamic pool, allowing it to spawn up to 1000 workers. After a fresh restart, the pool starts with 5 servers (also the min spares we keep around). We'll see the pool size settle at 50 (the max spares) after the benchmark completes and the pool winds down. cat /etc/php5/fpm/pool.d/www.conf [www] listen = /var/run/php5-fpm.sock listen.owner = jmikola listen.group = www-data listen.mode = 0666 user = jmikola group = www-data   pm = dynamic pm.max_children = 1000 pm.min_spare_servers = 5 pm.max_spare_servers = 50 pm.start_servers = 5 ab output I executed 100,000 requests (concurrency 500) through ab between 12:32:14 and 12:32:24 (approximately 10 seconds). The "length failures" reported are not significant (see: here ). So we can correlate this with the mongostat output below, I've wrapped the benchmark run with date output. $ date; ab -n 100000 -c 500 http://php1164.local/; date Tue Aug 12 12:32:14 EDT 2014 This is ApacheBench, Version 2.3 <$Revision: 1430300 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/   Benchmarking php1164.local (be patient) Completed 10000 requests Completed 20000 requests Completed 30000 requests Completed 40000 requests Completed 50000 requests Completed 60000 requests Completed 70000 requests Completed 80000 requests Completed 90000 requests Completed 100000 requests Finished 100000 requests     Server Software: nginx/1.4.1 Server Hostname: php1164.local Server Port: 80   Document Path: / Document Length: 181 bytes   Concurrency Level: 500 Time taken for tests: 9.543 seconds Complete requests: 100000 Failed requests: 2682 (Connect: 0, Receive: 0, Length: 2682, Exceptions: 0) Write errors: 0 Non-2xx responses: 97318 Total transferred: 34955051 bytes HTML transferred: 18938959 bytes Requests per second: 10478.66 [#/sec] (mean) Time per request: 47.716 [ms] (mean) Time per request: 0.095 [ms] (mean, across all concurrent requests) Transfer rate: 3576.97 [Kbytes/sec] received   Connection Times (ms) min mean[+/-sd] median max Connect: 0 12 168.4 0 7016 Processing: 1 30 135.0 11 4317 Waiting: 0 29 135.0 10 4317 Total: 2 42 224.8 11 7917   Percentage of the requests served within a certain time (ms) 50% 11 66% 12 75% 13 80% 14 90% 18 95% 23 98% 452 99% 1221 100% 7917 (longest request) Tue Aug 12 12:32:24 EDT 2014 mongostat output $ mongostat connected to: 127.0.0.1 insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time *0 *0 *0 *0 0 2|0 0 352m 11.9g 5.83g 0 test:0.0% 0 0|0 0|0 124b 7k 2 12:32:10 *0 *0 *0 *0 0 2|0 0 352m 11.9g 5.83g 0 test:0.0% 0 0|0 0|0 124b 7k 2 12:32:11 *0 *0 *0 *0 0 2|0 0 352m 11.9g 5.83g 0 test:0.0% 0 0|0 0|0 124b 7k 2 12:32:12 *0 *0 *0 *0 0 2|0 0 352m 11.9g 5.83g 0 test:0.0% 0 0|0 0|0 124b 7k 2 12:32:13 *0 *0 *0 *0 0 2|0 0 352m 11.9g 5.83g 0 test:0.0% 0 0|0 0|0 124b 7k 2 12:32:14 *0 100 *0 *0 0 323|0 0 352m 12.1g 5.83g 0 test:0.0% 0 21|0 59|0 19k 107k 196 12:32:15 *0 59 *0 *0 0 642|0 0 352m 12.1g 5.83g 0 test:0.0% 0 105|0 132|0 39k 150k 208 12:32:16 *0 3 *0 *0 0 223|0 0 352m 12.1g 5.83g 0 test:0.0% 0 125|1 160|0 13k 50k 191 12:32:18 *0 69 *0 *0 0 646|0 0 352m 12.2g 5.83g 0 test:0.0% 0 86|0 135|0 39k 164k 266 12:32:19 *0 *0 *0 *0 0 253|0 0 352m 12.2g 5.83g 0 test:0.0% 0 97|0 167|0 15k 55k 278 12:32:20 insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time *0 30 *0 *0 0 267|0 0 352m 12.2g 5.83g 0 test:0.0% 0 176|0 236|0 16k 69k 321 12:32:21 *0 7 *0 *0 0 289|0 0 352m 12.2g 5.83g 0 test:0.0% 0 266|1 269|0 17k 63k 452 12:32:23 *0 53 *0 *0 0 716|0 0 352m 12.2g 5.83g 0 test:0.0% 0 186|0 214|0 44k 166k 323 12:32:24 *0 *0 *0 *0 0 1|0 0 352m 12.2g 5.83g 0 test:0.1% 0 159|0 240|0 124b 7k 323 12:32:25 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.2% 0 184|0 258|0 124b 7k 323 12:32:26 *0 *0 *0 *0 0 3|0 0 352m 12.2g 5.83g 0 test:0.0% 0 210|0 291|0 124b 7k 323 12:32:27 *0 *0 *0 *0 0 1|0 0 352m 12.2g 5.83g 0 test:0.1% 0 273|0 305|0 124b 7k 323 12:32:28 *0 *0 *0 *0 0 3|0 0 352m 12.2g 5.83g 0 test:0.0% 0 244|1 317|0 186b 10k 323 12:32:29 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 69|0 250|0 124b 7k 323 12:32:31 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.2% 0 119|0 238|0 124b 7k 323 12:32:32 insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 150|0 236|0 124b 7k 323 12:32:33 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 135|0 223|0 124b 7k 323 12:32:34 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.1% 0 178|0 248|0 124b 7k 323 12:32:35 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 243|0 263|0 124b 7k 323 12:32:36 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.1% 0 215|0 278|0 124b 7k 323 12:32:37 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.2% 0 251|0 297|0 124b 7k 323 12:32:38 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 260|0 300|0 124b 7k 323 12:32:39 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.2% 0 289|0 310|0 124b 7k 323 12:32:40 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 0|0 321|0 124b 7k 323 12:32:41 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.1% 0 0|0 318|0 124b 7k 323 12:32:42 insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time *0 *0 *0 *0 0 2|0 1 352m 12.2g 5.83g 0 test:0.0% 0 0|0 321|0 124b 7k 323 12:32:43 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 0|0 321|0 124b 7k 323 12:32:44 *0 5 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.1% 0 0|0 314|0 124b 7k 323 12:32:45 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 0|0 306|0 124b 7k 323 12:32:46 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 0|0 321|0 124b 7k 323 12:32:47 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 0|0 321|0 124b 7k 323 12:32:48 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 0|0 317|0 124b 7k 323 12:32:49 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 0|0 321|0 124b 7k 323 12:32:50 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.2% 0 0|0 313|0 212b 7k 322 12:32:51 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 0|0 319|0 212b 7k 321 12:32:52 insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.83g 0 test:0.0% 0 30|0 118|0 300b 7k 319 12:32:53 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.84g 0 test:0.0% 0 0|0 287|0 388b 7k 316 12:32:54 *0 *0 *0 *0 0 2|0 0 352m 12.2g 5.84g 0 test:0.0% 0 0|0 299|0 1k 9k 304 12:32:55 *0 *0 *0 *0 0 3|0 0 352m 12.2g 5.84g 0 test:0.0% 0 0|0 284|0 1k 9k 286 12:32:56 *0 *0 *0 *0 0 1|0 0 352m 12.2g 5.84g 0 test:0.0% 0 0|0 263|0 1k 10k 265 12:32:57 *0 *0 *0 *0 0 2|0 0 352m 12.1g 5.84g 0 test:0.0% 0 0|0 232|0 2k 12k 235 12:32:58 *0 *0 *0 *0 0 2|0 0 352m 12.1g 5.84g 0 test:0.0% 0 0|0 184|0 4k 15k 186 12:32:59 *0 *0 *0 *0 0 2|0 0 352m 12g 5.84g 0 test:0.0% 0 0|0 124|0 5k 17k 126 12:33:00 *0 *0 *0 *0 0 2|0 0 352m 12g 5.85g 0 test:0.0% 0 0|0 43|0 7k 20k 46 12:33:01 *0 *0 *0 *0 0 2|0 0 352m 11.9g 5.85g 0 test:0.0% 0 0|0 0|0 3k 14k 2 12:33:02 insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time *0 *0 *0 *0 0 2|0 0 352m 11.9g 5.85g 0 test:0.0% 0 0|0 0|0 124b 7k 2 12:33:03 *0 *0 *0 *0 0 2|0 0 352m 11.9g 5.85g 0 test:0.0% 0 0|0 0|0 124b 7k 2 12:33:04 *0 *0 *0 *0 0 2|0 0 352m 11.9g 5.85g 0 test:0.0% 0 0|0 0|0 124b 7k 2 12:33:05 *0 *0 *0 *0 0 2|0 0 352m 11.9g 5.85g 0 test:0.0% 0 0|0 0|0 124b 7k 2 12:33:06 netstat output I used netstat to report on the number of connections lingering in a CLOSE_WAIT state. Again, each query here includes a date so we can correlate these numbers with the output from ab and mongostat . $ date; netstat -tonp 2>&1 | grep CLOSE | grep 127.0.0.1:27017 | wc -l Tue Aug 12 12:32:12 EDT 2014 0 [honeydew: ~] $ date; netstat -tonp 2>&1 | grep CLOSE | grep 127.0.0.1:27017 | wc -l Tue Aug 12 12:32:28 EDT 2014 321 [honeydew: ~] $ date; netstat -tonp 2>&1 | grep CLOSE | grep 127.0.0.1:27017 | wc -l Tue Aug 12 12:32:55 EDT 2014 295 [honeydew: ~] $ date; netstat -tonp 2>&1 | grep CLOSE | grep 127.0.0.1:27017 | wc -l Tue Aug 12 12:33:05 EDT 2014 0 Review After the benchmark completes, we see 321 connections in CLOSE_WAIT , which matches up with mongostat reporting 323 open connections (the extra two are those we had present before running the benchmark). During the benchmark, we had connections rise and fall based on load. After it completes, the active connections hang around in a CLOSE_WAIT state for about 30 seconds, after which point the server starts closing them. During that closing period, FPM's worker pool is still above the max spare servers and I found that it was reaping one or two worker processes a second until it reached pm.max_spare_servers of 50. I looked into why the connections were being closed server-side and it was due to memory limits reached during my sort queries: 2014-08-12T13:47:17.514-0400 [conn752] ERROR: Runner error, stats: { "type" : "SORT", "works" : 38771, "yields" : 514, "unyields" : 514, "invalidates" : 0, "advanced" : 0, "needTime" : 38769, "needFetch" : 0, "isEOF" : 0, "forcedFetches" : 0, "memUsage" : 33555076, "memLimit" : 33554432, "children" : [ { "type" : "COLLSCAN", "works" : 38769, "yields" : 514, "unyields" : 514, "invalidates" : 0, "advanced" : 38768, "needTime" : 1, "needFetch" : 0, "isEOF" : 0, "docsTested" : 38768, "children" : [] } ] } 2014-08-12T13:47:17.872-0400 [conn752] assertion 17144 Runner error: Overflow sort stage buffered data usage of 33555076 bytes exceeds internal limit of 33554432 bytes ns:test.php1164 query:{ $query: {}, $orderby: { text: 1 } } 2014-08-12T13:47:17.873-0400 [conn752] query test.php1164 query: { $query: {}, $orderby: { text: 1 } } planSummary: COLLSCAN ntoreturn:0 keyUpdates:0 exception: Runner error: Overflow sort stage buffered data usage of 33555076 bytes exceeds internal limit of 33554432 bytes code:17144 numYields:245 locks(micros) r:127374658 reslen:158 93891ms 2014-08-12T13:47:17.873-0400 [conn752] end connection 127.0.0.1:51812 (857 connections now open) This actually didn't happen when I repeated the test against 2.4.10. Instead I would see the queries pile up and continue to run (visible with db.currentOp() ) long after the PHP driver timed out. Running multiple benchmarks in a row could easily cause the connections to accumulate into the thousands (instead of the 300-400 peaks we saw in the stats above). I believe one problem here is that the driver isn't properly closing its connections when the worker process shuts down. We will look into this behavior and follow up after some more investigation. In the meantime, you may have some luck using $maxTimeMS in your queries (also available via MongoCursor::maxTimeMS() ). This query operator was added in 2.6 and allows you to limit the amount of time a query can run on the server side. It functions a bit differently than the client-side timeout, as its limit applies to total execution time of the query (across multiple batches) – this is mentioned in a paragraph on the PHP method documentation. Still, it should give you some immediate control over the situation until we can look into the driver some more.
        Hide
        jmikola Jeremy Mikola added a comment - - edited

        I verified that the 1.5.5 driver does correctly shutdown its open connections when the PHP worker terminates. To test this, I revised the query script (served via HTTP) to the following:

        <?php
         
        $mc = new MongoClient(null, ['socketTimeoutMS' => 100]);
        $coll = $mc->test->php1164;
        $cursor = $coll->find()->limit(10);
        var_dump(iterator_to_array($cursor));
        sleep(1);

        This query completes in far less than 100ms, so the one-second sleep() was added to ensure that the PHP workers actually get monopolized and FPM spins up many workers. Without the sleep, the 100,000-request benchmark only requires a few workers and I never see connection counts on the server rise beyond single digits. Most important, MongoDB is not left running a query after the driver moves on to the next request.

        After running the very same ab command (100,000 requests with 500 concurrent), the FPM pool hits a peak of 40-80 workers and then stabilizes with no more than 50 idle workers (this is expected based on pm.max_spare_servers). The following mongostat output records the server state from idle to benchmark execution to idle again. After execution completes at 17:06:36, the query count drops to 0 and we see 41 connections remain open, which matches the number of leftover FPM workers (plus one). At 17:06:40, I restarted FPM and we see the connection count immediately drop to 1.

        $ mongostat
        connected to: 127.0.0.1
        insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
            *0     *0     *0     *0       0     1|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0    62b     3k     1   17:06:25 
            *0      5     *0     *0       0     1|0       1   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0    62b     3k     1   17:06:26 
            *0      5     *0     *0       0    16|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0     1k    54k     6   17:06:27 
            *0      6     *0     *0       0     4|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0   713b    59k     7   17:06:28 
            *0      8     *0     *0       0     7|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0     1k    78k     9   17:06:29 
            *0     12     *0     *0       0    13|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0     1k   117k    13   17:06:30 
            *0     17     *0     *0       0    16|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0     2k   163k    18   17:06:31 
            *0     22     *0     *0       0    21|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0     2k   209k    23   17:06:32 
            *0     27     *0     *0       0    17|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0     3k   255k    28   17:06:33 
            *0     32     *0     *0       0    18|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0     3k   301k    33   17:06:34 
        insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
            *0     37     *0     *0       0    20|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0     4k   346k    38   17:06:35 
            *0     33     *0     *0       0    15|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0     3k   308k    41   17:06:36 
            *0     *0     *0     *0       0     1|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0    62b     3k    41   17:06:37 
            *0     *0     *0     *0       0     1|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0    62b     3k    41   17:06:38 
            *0     *0     *0     *0       0     1|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0    62b     3k    41   17:06:39 
            *0     *0     *0     *0       0     1|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0    62b     3k    41   17:06:40 
            *0     *0     *0     *0       0     1|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0    62b     3k     1   17:06:41 
            *0     *0     *0     *0       0     1|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0    62b     3k     1   17:06:42 
            *0     *0     *0     *0       0     1|0       0   352m  12.9g  11.7g      0  test:0.0%          0       0|0     0|0    62b     3k     1   17:06:43

        This tells me that the driver is correctly shutting down its open connections under normal operation. I believe the problem you're experiencing is specific to connections where the query is still executing server-side. In that case, socketTimeoutMS may be exasperating the issue, since the driver gives up on the long running query and the PHP extension happily closes its end of the connection during worker/process shutdown – meanwhile, the server is still churning away on the query and keeps the connection open. I would definitely suggest using $maxTimeMS in this case to limit server-side execution for any query at risk of running long.


        I revisited the original long-running sort query (still using socketTimeoutMS) , but added $maxTimeMS to the query:

        <?php
         
        $mc = new MongoClient(null, ['socketTimeoutMS' => 100]);
        $coll = $mc->test->php1164;
        $cursor = $coll->find()->maxTimeMS(2000)->sort(['text' => 1]);
        foreach ($cursor as $_) {}

        Predictably, this resulted in a MongoCursorTimeoutException for each request; however, the addition of $maxTimeMS ensured that the server abandoned the queries after two seconds of execution, which resulted in the connections close properly on both ends during the same benchmark load. In the following mongostat output, we go from idle to benchmark execution to idle again, but actually see the connection count drop all the way back to the idle number:

        $ mongostat
        connected to: 127.0.0.1
        insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
            *0     *0     *0     *0       0     1|0       0   608m  13.8g    12g      0  test:0.0%          0       0|0     0|0    62b     3k     2   17:19:40 
            *0     *0     *0     *0       0     1|0       0   608m  13.8g    12g      0  test:0.0%          0       0|0     0|0    62b     3k     2   17:19:41 
            *0     *0     *0     *0       0     1|0       0   608m  13.8g    12g      0  test:0.0%          0       0|0     0|0    62b     3k     2   17:19:42 
            *0     *0     *0     *0       0     1|0       0   608m  13.8g    12g      0  test:0.0%          0       0|0     0|0    62b     3k     2   17:19:43 
            *0     *0     *0     *0       0     1|0       0   608m  13.8g    12g      0  test:0.0%          0       0|0     0|0    62b     3k     2   17:19:44 
            *0     *0     *0     *0       0     1|0       0   608m  13.8g    12g      0  test:0.0%          0       0|0     0|0    62b     3k     2   17:19:45 
            *0     *0     *0     *0       0     1|0       0   608m  13.8g    12g      0  test:0.0%          0       0|0     0|0    62b     3k     2   17:19:46 
            *0    257     *0     *0     161   784|0       0   608m    14g    12g      0  test:0.0%          0      20|0   155|0    72k    23m   238   17:19:47 
            *0    297     *0     *0     103   893|0       0   608m    14g    12g      0  test:0.0%          0       3|0   221|0    79k    20m   249   17:19:49 
            *0    306     *0     *0     132   906|0       0   608m    14g    12g      0  test:0.0%          0       3|0   192|0    90k    27m   226   17:19:50 
        insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
            *0    287     *0     *0     106   884|0       0   608m  14.1g    12g      0  test:0.0%          0      47|1   195|0    85k    25m   237   17:19:51 
            *0    399     *0     *0     117  1180|0       0   608m    14g    12g      0  test:0.0%          0     136|0   216|0   114k    38m   249   17:19:52 
            *0    291     *0     *0     131   884|0       0   608m  14.1g    12g      0  test:0.0%          0     275|1   283|0    84k    24m   330   17:19:53 
            *0    381     *0     *0     126  1192|0       0   608m  14.1g    12g      0  test:0.0%          0     170|0   244|0   114k    34m   334   17:19:54 
            *0    265     *0     *0      78   736|0       0   608m    14g    12g      0  test:0.0%          0       1|0   139|0    74k    27m   244   17:19:55 
            *0    353     *0     *0     166  1081|0       0   608m  14.1g    12g      0  test:0.0%          0     160|0   262|0   102k    29m   320   17:19:56 
            *0    105     *0     *0      38   293|0       0   608m  13.8g    12g      0  test:0.0%          0       0|0     0|0    38k    16m     2   17:19:57 
            *0     *0     *0     *0       0     1|0       0   608m  13.8g    12g      0  test:0.0%          0       0|0     0|0    62b     3k     2   17:19:58 

        The final drop is due to behavior in the driver which closes a connection after certain error conditions, a client-side timeout being one of them. Therefore, even though FPM's worker pool still has 50 workers remaining after the benchmark completes, each of those active PHP processes saw the driver close its persistent connection after the last request and exception. On MongoDB's side, these long-running queries are terminated after two seconds and the server is free to close its end of the connection, too. This brings us back to the idle connection count.

        Given these results, I don't believe there is any behavior to change in the driver for this particular issue. As mentioned above, I would definitely suggest looking into $maxTimeMS for these problematic queries (if they can't be further optimized with indexing/schema changes). And hopefully the previous log is helpful if you decide to still use socketTimeoutMS alongside $maxTimeMS.

        Show
        jmikola Jeremy Mikola added a comment - - edited I verified that the 1.5.5 driver does correctly shutdown its open connections when the PHP worker terminates. To test this, I revised the query script (served via HTTP) to the following: <?php   $mc = new MongoClient(null, ['socketTimeoutMS' => 100]); $coll = $mc->test->php1164; $cursor = $coll->find()->limit(10); var_dump(iterator_to_array($cursor)); sleep(1); This query completes in far less than 100ms, so the one-second sleep() was added to ensure that the PHP workers actually get monopolized and FPM spins up many workers. Without the sleep, the 100,000-request benchmark only requires a few workers and I never see connection counts on the server rise beyond single digits. Most important, MongoDB is not left running a query after the driver moves on to the next request. After running the very same ab command (100,000 requests with 500 concurrent), the FPM pool hits a peak of 40-80 workers and then stabilizes with no more than 50 idle workers (this is expected based on pm.max_spare_servers ). The following mongostat output records the server state from idle to benchmark execution to idle again. After execution completes at 17:06:36 , the query count drops to 0 and we see 41 connections remain open, which matches the number of leftover FPM workers (plus one). At 17:06:40 , I restarted FPM and we see the connection count immediately drop to 1. $ mongostat connected to: 127.0.0.1 insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time *0 *0 *0 *0 0 1|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 62b 3k 1 17:06:25 *0 5 *0 *0 0 1|0 1 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 62b 3k 1 17:06:26 *0 5 *0 *0 0 16|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 1k 54k 6 17:06:27 *0 6 *0 *0 0 4|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 713b 59k 7 17:06:28 *0 8 *0 *0 0 7|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 1k 78k 9 17:06:29 *0 12 *0 *0 0 13|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 1k 117k 13 17:06:30 *0 17 *0 *0 0 16|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 2k 163k 18 17:06:31 *0 22 *0 *0 0 21|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 2k 209k 23 17:06:32 *0 27 *0 *0 0 17|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 3k 255k 28 17:06:33 *0 32 *0 *0 0 18|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 3k 301k 33 17:06:34 insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time *0 37 *0 *0 0 20|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 4k 346k 38 17:06:35 *0 33 *0 *0 0 15|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 3k 308k 41 17:06:36 *0 *0 *0 *0 0 1|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 62b 3k 41 17:06:37 *0 *0 *0 *0 0 1|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 62b 3k 41 17:06:38 *0 *0 *0 *0 0 1|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 62b 3k 41 17:06:39 *0 *0 *0 *0 0 1|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 62b 3k 41 17:06:40 *0 *0 *0 *0 0 1|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 62b 3k 1 17:06:41 *0 *0 *0 *0 0 1|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 62b 3k 1 17:06:42 *0 *0 *0 *0 0 1|0 0 352m 12.9g 11.7g 0 test:0.0% 0 0|0 0|0 62b 3k 1 17:06:43 This tells me that the driver is correctly shutting down its open connections under normal operation. I believe the problem you're experiencing is specific to connections where the query is still executing server-side. In that case, socketTimeoutMS may be exasperating the issue, since the driver gives up on the long running query and the PHP extension happily closes its end of the connection during worker/process shutdown – meanwhile, the server is still churning away on the query and keeps the connection open. I would definitely suggest using $maxTimeMS in this case to limit server-side execution for any query at risk of running long. I revisited the original long-running sort query (still using socketTimeoutMS ) , but added $maxTimeMS to the query: <?php   $mc = new MongoClient(null, ['socketTimeoutMS' => 100]); $coll = $mc->test->php1164; $cursor = $coll->find()->maxTimeMS(2000)->sort(['text' => 1]); foreach ($cursor as $_) {} Predictably, this resulted in a MongoCursorTimeoutException for each request; however, the addition of $maxTimeMS ensured that the server abandoned the queries after two seconds of execution, which resulted in the connections close properly on both ends during the same benchmark load. In the following mongostat output, we go from idle to benchmark execution to idle again, but actually see the connection count drop all the way back to the idle number: $ mongostat connected to: 127.0.0.1 insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time *0 *0 *0 *0 0 1|0 0 608m 13.8g 12g 0 test:0.0% 0 0|0 0|0 62b 3k 2 17:19:40 *0 *0 *0 *0 0 1|0 0 608m 13.8g 12g 0 test:0.0% 0 0|0 0|0 62b 3k 2 17:19:41 *0 *0 *0 *0 0 1|0 0 608m 13.8g 12g 0 test:0.0% 0 0|0 0|0 62b 3k 2 17:19:42 *0 *0 *0 *0 0 1|0 0 608m 13.8g 12g 0 test:0.0% 0 0|0 0|0 62b 3k 2 17:19:43 *0 *0 *0 *0 0 1|0 0 608m 13.8g 12g 0 test:0.0% 0 0|0 0|0 62b 3k 2 17:19:44 *0 *0 *0 *0 0 1|0 0 608m 13.8g 12g 0 test:0.0% 0 0|0 0|0 62b 3k 2 17:19:45 *0 *0 *0 *0 0 1|0 0 608m 13.8g 12g 0 test:0.0% 0 0|0 0|0 62b 3k 2 17:19:46 *0 257 *0 *0 161 784|0 0 608m 14g 12g 0 test:0.0% 0 20|0 155|0 72k 23m 238 17:19:47 *0 297 *0 *0 103 893|0 0 608m 14g 12g 0 test:0.0% 0 3|0 221|0 79k 20m 249 17:19:49 *0 306 *0 *0 132 906|0 0 608m 14g 12g 0 test:0.0% 0 3|0 192|0 90k 27m 226 17:19:50 insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time *0 287 *0 *0 106 884|0 0 608m 14.1g 12g 0 test:0.0% 0 47|1 195|0 85k 25m 237 17:19:51 *0 399 *0 *0 117 1180|0 0 608m 14g 12g 0 test:0.0% 0 136|0 216|0 114k 38m 249 17:19:52 *0 291 *0 *0 131 884|0 0 608m 14.1g 12g 0 test:0.0% 0 275|1 283|0 84k 24m 330 17:19:53 *0 381 *0 *0 126 1192|0 0 608m 14.1g 12g 0 test:0.0% 0 170|0 244|0 114k 34m 334 17:19:54 *0 265 *0 *0 78 736|0 0 608m 14g 12g 0 test:0.0% 0 1|0 139|0 74k 27m 244 17:19:55 *0 353 *0 *0 166 1081|0 0 608m 14.1g 12g 0 test:0.0% 0 160|0 262|0 102k 29m 320 17:19:56 *0 105 *0 *0 38 293|0 0 608m 13.8g 12g 0 test:0.0% 0 0|0 0|0 38k 16m 2 17:19:57 *0 *0 *0 *0 0 1|0 0 608m 13.8g 12g 0 test:0.0% 0 0|0 0|0 62b 3k 2 17:19:58 The final drop is due to behavior in the driver which closes a connection after certain error conditions, a client-side timeout being one of them. Therefore, even though FPM's worker pool still has 50 workers remaining after the benchmark completes, each of those active PHP processes saw the driver close its persistent connection after the last request and exception. On MongoDB's side, these long-running queries are terminated after two seconds and the server is free to close its end of the connection, too. This brings us back to the idle connection count. Given these results, I don't believe there is any behavior to change in the driver for this particular issue. As mentioned above, I would definitely suggest looking into $maxTimeMS for these problematic queries (if they can't be further optimized with indexing/schema changes). And hopefully the previous log is helpful if you decide to still use socketTimeoutMS alongside $maxTimeMS .
        Hide
        edmnc Edmunds Kalnins added a comment - - edited

        Unfortunatelly we are unable to upgrade production to 2.6 because this issue is still hitting us. I didn't answer anything earlier, because I can't pinpoint this situation in testing after adding maxTimeMS() to find queries. But it does, unfortunatelly, happend in production – it only takes about 10 minutes for connections to start piling up and primary server to become unresponsive after we try upgrading (with queries now using maxTimeMS()) . Again – everything "Works as Designed" after downgrading to 2.4

        Some notes about our setup, that might be helpful: we are currently serving ~400 ops/sec peaking at 1k ops/sec on the primary server (with 2 secondaries each serving additional ~20% of that). The production servers normally have about 300 connections – we use 5 application servers running php-fpm with static child count (pm = static; pm.max_children = 60; pm.max_requests = 1500). Some queries will always be without an index and will take more than a few seconds. We need to allow this, even to allow some queries to simply time out. The catch-22 situation here is that some (but not all) of these slow queries would be solved by using text index, which is only available in 2.6

        Anyway, I will continue trying to create a test that reliably recreates this situation and open a new issue or reopen this (or apologize for taking your time, if it turns out to be our mistake).

        Show
        edmnc Edmunds Kalnins added a comment - - edited Unfortunatelly we are unable to upgrade production to 2.6 because this issue is still hitting us. I didn't answer anything earlier, because I can't pinpoint this situation in testing after adding maxTimeMS() to find queries. But it does, unfortunatelly, happend in production – it only takes about 10 minutes for connections to start piling up and primary server to become unresponsive after we try upgrading (with queries now using maxTimeMS()) . Again – everything "Works as Designed" after downgrading to 2.4 Some notes about our setup, that might be helpful: we are currently serving ~400 ops/sec peaking at 1k ops/sec on the primary server (with 2 secondaries each serving additional ~20% of that). The production servers normally have about 300 connections – we use 5 application servers running php-fpm with static child count (pm = static; pm.max_children = 60; pm.max_requests = 1500). Some queries will always be without an index and will take more than a few seconds. We need to allow this, even to allow some queries to simply time out. The catch-22 situation here is that some (but not all) of these slow queries would be solved by using text index, which is only available in 2.6 Anyway, I will continue trying to create a test that reliably recreates this situation and open a new issue or reopen this (or apologize for taking your time, if it turns out to be our mistake).
        Hide
        jmikola Jeremy Mikola added a comment -

        Edmunds Kalnins: When you added $maxTimeMS, did you stop using socketTimeoutMS? If you're using both, I'd be curious to see which values you have for each. Ideally, your client-side timeout should exceed $maxTimeMS and some additional padding, and you can avoid the driver abandoning in-progress queries entirely – instead waiting for the server to report that it halted execution after it surpassed $maxTimeMS. That said, the client-side timeout is still helpful if you're worried about picking up dropped connections quickly (e.g. replica set failover).

        If the PHP requests are not abandoning connections, I wouldn't expect to see more than 300 connections (5 * 60).

        The mongod logs should also be reporting the open/closed connections and completed/halted queries (you may need to bump up verbosity). If this is easily reproducible on 2.6, I'd be curious to see a snippet of those logs alongside a summary of what's happening on the PHP side.

        Show
        jmikola Jeremy Mikola added a comment - Edmunds Kalnins : When you added $maxTimeMS , did you stop using socketTimeoutMS ? If you're using both, I'd be curious to see which values you have for each. Ideally, your client-side timeout should exceed $maxTimeMS and some additional padding, and you can avoid the driver abandoning in-progress queries entirely – instead waiting for the server to report that it halted execution after it surpassed $maxTimeMS . That said, the client-side timeout is still helpful if you're worried about picking up dropped connections quickly (e.g. replica set failover). If the PHP requests are not abandoning connections, I wouldn't expect to see more than 300 connections (5 * 60). The mongod logs should also be reporting the open/closed connections and completed/halted queries (you may need to bump up verbosity). If this is easily reproducible on 2.6, I'd be curious to see a snippet of those logs alongside a summary of what's happening on the PHP side.
        Hide
        edmnc Edmunds Kalnins added a comment -

        We do not use socketTimeoutMS, only maxTimeMS (20 secs). If PHP side times out it would be because of either fpm's request_terminate_timeout (120 secs) or max_execution_time (30 secs default)

        When connections start piling up we go to >1000 connections. This I can simulate in testing easely. I have locally on my laptop installed nginx and php-fpm with pm = ondemand pm.max_children = 20 If I construct a request that takes more than 5 secs on the database side (while having maxTimeMS=5000) and set socketTimeoutMS=200 so that I can generate lots of new requests, the connection count goes above 300 when using even ab -c 20. But in testing it never really takes the DB down – I can create new requests no prob.

        So currently I am trying to find out if there are some requests or commands that take much longer on 2.6 than 2.4 and for some reason a) ignore maxTimeMS or b) for some reason we don't apply maxTimeMS to them. No luck yet.

        Show
        edmnc Edmunds Kalnins added a comment - We do not use socketTimeoutMS , only maxTimeMS (20 secs). If PHP side times out it would be because of either fpm's request_terminate_timeout (120 secs) or max_execution_time (30 secs default) When connections start piling up we go to >1000 connections. This I can simulate in testing easely. I have locally on my laptop installed nginx and php-fpm with pm = ondemand pm.max_children = 20 If I construct a request that takes more than 5 secs on the database side (while having maxTimeMS=5000 ) and set socketTimeoutMS=200 so that I can generate lots of new requests, the connection count goes above 300 when using even ab -c 20 . But in testing it never really takes the DB down – I can create new requests no prob. So currently I am trying to find out if there are some requests or commands that take much longer on 2.6 than 2.4 and for some reason a) ignore maxTimeMS or b) for some reason we don't apply maxTimeMS to them. No luck yet.
        Hide
        jmikola Jeremy Mikola added a comment -

        Edmunds Kalnins: $maxTimeMS applies a limit on actual processing time, which will not include time spent yielding. In a high-request environment with page faults and the like, it's quite possible that queries will take a bit longer than $maxTimeMS to return to the client; however, it does seem unlikely that there'd be an extra 10 seconds to hit max_execution_time. Can you find any correlation between PHP requests killed via FPM's limit and rising connections on MongoDB's side?

        But in testing it never really takes the DB down – I can create new requests no prob.

        In your testing environment, are you piling up more connections than your actual production environment? Is there any reason you can think of that the test environment can handle the increased connection load that production appears to choke on?

        I assume the queries running against your test environment (your local machine if I understand correctly) are more contrived than those running on production. Can you possibly collect the slow production queries using the database profiler with a threshold?

        Show
        jmikola Jeremy Mikola added a comment - Edmunds Kalnins : $maxTimeMS applies a limit on actual processing time, which will not include time spent yielding. In a high-request environment with page faults and the like, it's quite possible that queries will take a bit longer than $maxTimeMS to return to the client; however, it does seem unlikely that there'd be an extra 10 seconds to hit max_execution_time . Can you find any correlation between PHP requests killed via FPM's limit and rising connections on MongoDB's side? But in testing it never really takes the DB down – I can create new requests no prob. In your testing environment, are you piling up more connections than your actual production environment? Is there any reason you can think of that the test environment can handle the increased connection load that production appears to choke on? I assume the queries running against your test environment (your local machine if I understand correctly) are more contrived than those running on production. Can you possibly collect the slow production queries using the database profiler with a threshold?
        Hide
        derick Derick Rethans added a comment -

        The old legacy PHP driver "mongo" is no longer maintained. A new driver "mongodb" has replaced the old driver, and is the only one also supporting PHP 7. As we do not support the old driver any more (except for very critical issues), we are closing all associated bug reports.

        The new driver "mongodb" has a different API and a different architecture. The reasoning for this is explained in a series of blog posts:

        In order to help migrating to the new driver, there is a "mongo-php-adapter", created by our community, and available through composer (https://packagist.org/packages/alcaeus/mongo-php-adapter). You can use this as a stop-gap measure to migrate from the old driver, to the new driver + mongo-php-library (https://packagist.org/packages/mongodb/mongodb).

        But in the end, you ideally should switch to the "mongodb" PECL extension in combination with the mongo-php-library. An upgrade guide is available at http://mongodb.github.io/mongo-php-library/upgrade-guide/

        If you have any further questions, feel free to reach out to the mongodb-user Google Group (https://groups.google.com/forum/#!forum/mongodb-user). Mentioning PHP in the subject will draw our attention to these posts better.

        Show
        derick Derick Rethans added a comment - The old legacy PHP driver "mongo" is no longer maintained. A new driver "mongodb" has replaced the old driver, and is the only one also supporting PHP 7. As we do not support the old driver any more (except for very critical issues), we are closing all associated bug reports. The new driver "mongodb" has a different API and a different architecture. The reasoning for this is explained in a series of blog posts: https://derickrethans.nl/new-drivers.html https://derickrethans.nl/new-drivers-part2.html In order to help migrating to the new driver, there is a "mongo-php-adapter", created by our community, and available through composer ( https://packagist.org/packages/alcaeus/mongo-php-adapter ). You can use this as a stop-gap measure to migrate from the old driver, to the new driver + mongo-php-library ( https://packagist.org/packages/mongodb/mongodb ). But in the end, you ideally should switch to the "mongodb" PECL extension in combination with the mongo-php-library. An upgrade guide is available at http://mongodb.github.io/mongo-php-library/upgrade-guide/ If you have any further questions, feel free to reach out to the mongodb-user Google Group ( https://groups.google.com/forum/#!forum/mongodb-user ). Mentioning PHP in the subject will draw our attention to these posts better.

          People

          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:
              Days since reply:
              38 weeks ago
              Date of 1st Reply: