[SERVER-12878] Aggregations that do not return cursors need to clean up internal cursors Created: 25/Feb/14  Updated: 11/Jul/16  Resolved: 05/Mar/14

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 2.6.0-rc0
Fix Version/s: 2.6.0-rc1

Type: Bug Priority: Major - P3
Reporter: Tim Callaghan Assignee: Mathias Stearn
Resolution: Done Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 13.10


Attachments: PNG File SERVER-12878-249-aggr.png     PNG File SERVER-12878-249-qry.png     PNG File SERVER-12878-249-readlock.png     PNG File SERVER-12878-249-write.png     PNG File SERVER-12878-249-writelock.png     PNG File SERVER-12878-249.png     PNG File SERVER-12878-rc1-aggr.png     PNG File SERVER-12878-rc1-qry.png     PNG File SERVER-12878-rc1-readlock.png     PNG File SERVER-12878-rc1-write.png     PNG File SERVER-12878-rc1-writelock.png     PNG File SERVER-12878.png     File prof-0.svg     File prof-8.svg    
Issue Links:
Depends
Related
related to SERVER-13115 Delete operations should parse querie... Closed
is related to SERVER-13271 remove surplus projection from distinct Closed
Backwards Compatibility: Fully Compatible
Operating System: Linux
Steps To Reproduce:
  • start a mongodb 2.4.9 or 2.6.0.rc0 server
  • on the same server,
    • git clone https://github.com/tmcallaghan/sysbench-mongodb.git
    • cd sysbench-mongodb
    • edit run.simple.bash to modify the benchmark behavior
      • modify "export NUM_DOCUMENTS_PER_COLLECTION=10000000", by changing 10 million to "1000000". This will keep the loaded data to under 8GB.
    • Note, you must have ant and Java 1.7 installed to run the benchmark application. If using Java 1.6, change the build.xml file's target="1.7" to target="1.6". You also need to have the MongoDB Java driver in your CLASSPATH, as in "export CLASSPATH=/home/tcallaghan/java_goodies/mongo-2.11.2.jar:.".
    • ./run.simple.bash

The benchmark runs in two stages:
1. 16 collections are created, and each is loaded with 1 million documents. Eight loader threads run simultaneously, each with their own collection.
2. 64 threads run the Sysbench workload

Both stages output their respective performance.

  • I'm getting around 30K inserts per second for the loader.
  • For the mixed workload I'm getting 550 TPS on 2.4.9, but less than 10 TPS on 2.6.0.rc0
Participants:

 Description   

Aggregation commands that do not return a cursor need to release the ClientCursor object that they create to do their work. As of 2.6.0-rc0, they do not. This leads to increasing workload for every update and delete operation, as they must scan those zombie cursors and inform them of certain relevant events.

Original description / steps to reproduce:

In running benchmarks to compare the performance of 2.4.9 vs. 2.6.0.rc0 I see a significant drop in my Sysbench workload.

The performance of both versions is comparable for loading data (insert only), but in the mixed workload phase (reads, aggregation, update, delete, insert) the performance of 2.6.0.rc0 is 50x lower on an in-memory test. The performance drop is similar for > RAM testing as well.

I have confirmed this on my desktop (single Corei7) and a server (dual Xeon 5560), both running Ubuntu 13.10, with read-ahead set to 32.

I'm running with the Java driver version 2.11.2



 Comments   
Comment by Githook User [ 19/Mar/14 ]

Author:

{u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}

Message: SERVER-10026 SERVER-13271 SERVER-12878 SERVER-2094 avoid surplus projections in distinct
Branch: v2.6
https://github.com/mongodb/mongo/commit/094f1565d6f82859bc38300b45564dd1ea9f070e

Comment by Githook User [ 19/Mar/14 ]

Author:

{u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}

Message: SERVER-10026 SERVER-13271 SERVER-12878 SERVER-2094 avoid surplus projections in distinct
Branch: master
https://github.com/mongodb/mongo/commit/267f56a7e0ce36eba21b4b2ef09e32a43370acbf

Comment by Tim Callaghan [ 05/Mar/14 ]

Andy, I agree, and I think that's what Hari originally did for this ticket in this commit.

https://github.com/mongodb/mongo/commit/a3d910925350d2f0204b41ea145e24f74e5c39ce

Comment by Githook User [ 05/Mar/14 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-12878 Delete ClientCursor in agg unless returning a cursor to user.
Branch: master
https://github.com/mongodb/mongo/commit/8d43b5cb9949c16452cb8d949c89d94cab9c8bad

Comment by Andy Schwerin [ 05/Mar/14 ]

Tim, someone must have done something to improve the query planning problem you were seeing since RC0, if so. Maybe hari.khalsa@10gen.com's SERVER-12915, maybe some others, as well. With the resource leak patched (commit coming), I see the same performance in 2.6.0-rc1-pre- and 2.4.9. I encourage you to run against RC1, which should be out this week, or off of master after this ticket is resolved.

For reference this is what I see in 2.4.9 for the last hundred seconds on an Intel Core i7-2600 CPU @ 3.40GHz, with 16GB of RAM and data stored on a passable Intel SSD:

Thread[Thread-66,5,main]510 seconds : cum tps=636.60 : int tps=706.11 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]520 seconds : cum tps=639.59 : int tps=791.79 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]530 seconds : cum tps=637.40 : int tps=523.68 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]540 seconds : cum tps=633.70 : int tps=437.45 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]550 seconds : cum tps=633.70 : int tps=633.99 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]560 seconds : cum tps=629.76 : int tps=412.91 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]570 seconds : cum tps=631.99 : int tps=756.72 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]580 seconds : cum tps=634.03 : int tps=751.69 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]590 seconds : cum tps=636.66 : int tps=789.27 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]600 seconds : cum tps=631.57 : int tps=330.94 : cum ips=0.00 : int ips=0.00

And this is what I see for 2.6.0-RC1-pre-:

Thread[Thread-66,5,main]510 seconds : cum tps=636.79 : int tps=712.50 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]520 seconds : cum tps=638.70 : int tps=735.96 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]530 seconds : cum tps=632.31 : int tps=300.56 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]540 seconds : cum tps=631.20 : int tps=572.19 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]550 seconds : cum tps=632.64 : int tps=710.50 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]560 seconds : cum tps=631.15 : int tps=549.06 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]570 seconds : cum tps=628.61 : int tps=485.06 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]580 seconds : cum tps=630.93 : int tps=763.01 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]590 seconds : cum tps=632.65 : int tps=732.39 : cum ips=0.00 : int ips=0.00
Thread[Thread-66,5,main]600 seconds : cum tps=625.19 : int tps=185.55 : cum ips=0.00 : int ips=0.00

I assume from all of the results (yours and mine) that the last 10 seconds might see some amount of cool-down behavior. I've typically been ignoring it.

Comment by Tim Callaghan [ 05/Mar/14 ]

Andy, no problem at all, although my original ticket name is likely appropriate as I think there was also a big issue with query planning within the workload, and your new discovery is yet another issue that my benchmark uncovered.

Comment by Andy Schwerin [ 05/Mar/14 ]

tmcallaghan, thanks for the excellent repro. I've updated the ticket summary and description to describe the underlying bug that you unmasked.

Comment by Andy Schwerin [ 05/Mar/14 ]

I did a differential profile, comparing the first 30 seconds of the sysbench (second) stage, and the penultimate 30 seconds in 2.6.0-rc1. The good news is that RC1 is substantially faster than RC0, but it still shows the performance decay pattern that tmcallaghan found, and still has a 7x performance degradation over the life of the benchmark.

The attached SVG files show call graphs weighted by the number of samples associated with each node. Note that CollectionCursorCache::invalidateDocument goes from around 14% of execution time to almost 60% of execution time over the life of the benchmark, and a substantial amount of time in any event is spent in shared pointer atomic arithmetic and red-black tree traversal.

I will investigate further. There's a reasonable likelihood of a resource leak in the client cursor cache, or a behavior change from 2.4 about cursor expiration.

Comment by Daniel Pasette (Inactive) [ 27/Feb/14 ]

Yep, that's why the ticket is still open. We're still looking at it. Thanks for testing. We will update here with what we find.

Comment by James Blackburn [ 27/Feb/14 ]

I ran Tim's test, using master which includes Hari's change a3d910... and while it's better it's still looks 3-4x worse than 2.4.6:

with:
+export NUM_DOCUMENTS_PER_COLLECTION=100000

2.4.6, I see:

     [java] Thread[Thread-66,5,main]580 seconds : cum tps=469.46 : int tps=435.33 : cum ips=0.00 : int ips=0.00
     [java] Thread[Thread-66,5,main]590 seconds : cum tps=468.82 : int tps=431.83 : cum ips=0.00 : int ips=0.00
     [java] Thread[Thread-66,5,main]600 seconds : cum tps=468.30 : int tps=437.59 : cum ips=0.00 : int ips=0.00

3709484b65d0c153d4d52713e8466b7b3a791233

     [java] Thread[Thread-66,5,main]580 seconds : cum tps=202.98 : int tps=122.27 : cum ips=0.00 : int ips=0.00
     [java] Thread[Thread-66,5,main]590 seconds : cum tps=201.59 : int tps=121.14 : cum ips=0.00 : int ips=0.00
     [java] Thread[Thread-66,5,main]600 seconds : cum tps=200.23 : int tps=120.11 : cum ips=0.00 : int ips=0.00

Comment by Tim Callaghan [ 26/Feb/14 ]

I'll rerun my test when RC1 is available.

Comment by hari.khalsa@10gen.com [ 26/Feb/14 ]

Hello! Thank you for this ticket. I identified and fixed a major performance issue identified by this suite.

In brief, the server decided to use a collection scan when it should have used an index scan.

It made this decision because neither the indexed plan nor the collection scan produced any results while we were comparing the two plans to see which was better. In absence of information about how well the plans did with respect to one another, we chose the collection scan. This is a departure from 2.4's behavior where we wouldn't consider a collection scan if we had an indexed plan available. I submitted a patch to master restore this behavior:

https://github.com/mongodb/mongo/commit/a3d910925350d2f0204b41ea145e24f74e5c39ce

Feel encouraged to test it out yourself by pulling the latest code or just wait for RC1 which will have this patch.

Thanks again!

Comment by Tim Callaghan [ 26/Feb/14 ]

Good, I assume those COLLSCAN lines in my server's log file is an optimizer decision gone bad.

As I said, I'm kicking off my full suite of benchmarks and will report anything else I find as an additional ticket.

Comment by Eric Milkie [ 26/Feb/14 ]

Hi Tim,
We reproduced it and are making progress. It appears that the ordered range query is the one that's the dominant cause of the regression.

Comment by Tim Callaghan [ 26/Feb/14 ]

Please let me know if you can reproduce my results. In the meantime I'll run my full benchmark suite comparing 2.4.9 vs. 2.6.0.rc0 to see if there are other regressions.

Comment by Tim Callaghan [ 26/Feb/14 ]

Sure, I just reran the test on my desktop.

  • hardware = Core i7-2600K, 32GB RAM, Samsung 830 SSD, read-ahead=32.
  • changed run.simple.bash to insert 1 million rows per collection (not 10 million), so in-memory test

MongoDB 2.4.9
load phase = 30,472 documents inserted per second
last 100 seconds of the 10 minute benchmark,
[java] Thread[Thread-66,5,main]500 seconds : cum tps=547.38 : int tps=512.29 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]510 seconds : cum tps=548.47 : int tps=603.12 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]520 seconds : cum tps=549.37 : int tps=595.16 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]530 seconds : cum tps=549.28 : int tps=544.66 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]540 seconds : cum tps=545.50 : int tps=345.22 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]550 seconds : cum tps=546.86 : int tps=620.30 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]560 seconds : cum tps=546.64 : int tps=534.42 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]570 seconds : cum tps=544.97 : int tps=451.58 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]580 seconds : cum tps=546.00 : int tps=604.52 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]590 seconds : cum tps=546.36 : int tps=567.35 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]600 seconds : cum tps=542.87 : int tps=336.90 : cum ips=0.00 : int ips=0.00

MongoDB 2.6.0.rc0
load phase = 33,323 documents inserted per second (not bad, ~10% improvement)
last 100 seconds of the 10 minute benchmark,
[java] Thread[Thread-66,5,main]500 seconds : cum tps=11.04 : int tps=11.49 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]510 seconds : cum tps=11.03 : int tps=10.49 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]520 seconds : cum tps=11.03 : int tps=11.10 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]530 seconds : cum tps=11.04 : int tps=11.39 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]540 seconds : cum tps=11.04 : int tps=10.99 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]550 seconds : cum tps=11.05 : int tps=11.69 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]560 seconds : cum tps=11.05 : int tps=10.79 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]570 seconds : cum tps=11.05 : int tps=11.29 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]580 seconds : cum tps=11.05 : int tps=11.09 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]590 seconds : cum tps=11.06 : int tps=11.49 : cum ips=0.00 : int ips=0.00
[java] Thread[Thread-66,5,main]600 seconds : cum tps=11.05 : int tps=10.79 : cum ips=0.00 : int ips=0.00

Comment by Eric Milkie [ 25/Feb/14 ]

Hi Tim,
Can you post a sample run of sysbench so I can compare your results with mine locally?
Thanks
-Eric

Comment by Tim Callaghan [ 25/Feb/14 ]

Please let me know if there is any additional data I can provide.

Generated at Thu Feb 08 03:29:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.