[SERVER-21754] CursorManager Lock is very hot Created: 03/Dec/15 Updated: 06/Dec/17 Resolved: 26/May/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | None |
| Fix Version/s: | 3.5.8 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Jim Van Fleet | Assignee: | Charlie Swanson |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Sprint: | Query 10 (02/22/16), Query 14 (05/13/16), Query 15 (06/03/16), Query 2017-03-27, Query 2017-04-17, Query 2017-05-08, Query 2017-05-29 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Under heavy load the cursor manager lock causes negative scalability. – |
| Comments |
| Comment by Charlie Swanson [ 11/Jul/17 ] | |||||||||||||||||||||||||||||||||||||||||
|
tdeneau to specifically address your questions:
Yes, that work will be tracked by the new ticket. I don't know exactly when it will happen, but you can watch it for updates! If you're particularly motivated, I would encourage you to try it yourself and open a pull request if it works!
Not that I know of. We could conceivably add some sort of parameter to disable the top reporting (see the top command docs), but I don't think we have such a thing today. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Charlie Swanson [ 11/Jul/17 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi all, Thank you for all your work investigating. I've opened a new ticket SERVER-30085 which I've linked to this one. Let's continue to track this investigation on that ticket, posting new findings there please. As I mentioned above, this ticket was scheduled mostly because we anticipated recent changes would increase contention on this already hot mutex. We did not actually solve the larger problem of increasing read scalability at high core counts though, so that work will be tracked by the new ticket, rather than re-opening this one. We really do appreciate all the investigation being done here, thanks for the help! Charlie | |||||||||||||||||||||||||||||||||||||||||
| Comment by Tom Deneau [ 08/Jul/17 ] | |||||||||||||||||||||||||||||||||||||||||
|
Charlie – I have built from the tip of master getting My problem: I am still seeing poor scaling when trying to run the server on 64 cores (I am doing a read-only test, ycsb workloadc with as many ycsb clients running as needed to keep the server cores busy). When I run lttng traces with tracepoints in pthread_mutex_lock/unlock and kernel tracepoints in syscall_futex_enter/exit,I see that the previously hot Cursor lock, now partitioned is no longer an issue. However, I am still seeing some pretty heavily contended locks. I am not sure the best way to map these to mutexes in the source code while the server is busy (suggestions welcome), but using gdb, I saw that pthread_mutex_lock was called on these mutex addresses occasionally even when the server is "quiet". From this, I can see that the busy locks are:
Are there any plans to address these contended lock issues, perhaps by additional partitioning? Also, are there any configuration options that would eliminate the need to go thru the code that calls these locks (on something like a ycsb db read-only benchmark)? | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 28/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||
|
Charlie – I ran with your latest branch and collected performance data. There is some cache line collisions in futexes, but on lightly used futexes. There may be other false sharing, but I can not directly detect that. I made a quick hack to the definition of SimpleMutex to align the mutex on a cache line boundary and added padding to fill the cache line. There was a very small improvement in overall performance and, of course, there were fewer obvious cache line overlaps in the futexes. Though it does not seem an issue now, it would probably be advisable to align mutexes/futexes to a processor dependent cache line size and pad to fill the cache line – probably would not hurt to align and pad all mutexes ... In short, I do not think that there is a performance degradation because of the partitioned locks being in the same cache line. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Charlie Swanson [ 26/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi everyone, I just wanted to post an update as to the current status of this ticket. As a summary, my commit that went in to the 3.5.8 release attempted to address this problem by partitioning the lock into many locks, since we usually only need to interact with one entry in these maps/sets. There was some concern that doing so didn't actually help, or maybe actually hurt performance since the mutexes may lie on the same cache line - see Here's a branch with some of the work we're evaluating. Any information you can provide about whether this approach works would be much appreciated. tdeneau, this change was motivated by recent work to move all aggregation cursors to the global cursor manager (see | |||||||||||||||||||||||||||||||||||||||||
| Comment by Tom Deneau [ 19/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||
|
I am a latecomer to this issue but I am also interested in anything that helps scaling on hardware with lots of cores. We also noticed the poor scaling with the CursorManager registerExecutor and deregisterExecutor. I can confirm that building from master helps the score in our read-only ycsb benchmark runs and that the Executors lock overhead is signficantly reduced. But as was mentioned elsewhere there appear to be other scaling issues beyond the Executors lock which now dominate. Is the Partitioned template solution planned to be used for any other lock scaling issues? | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 14/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||
|
Throughput on the master is now about the same as 3.4. There is no cursor manager lock contention to speak of. This can be closed (if it is not already closed) | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 09/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||
|
I incorrectly reported that the cursor manager patch decreased performance. It is actually about the same as 3.5 base. There is still a regression between 3.4 base and 3.5 base | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 05/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||
|
changed some benchmark parameters (thread count) and I seem to have a stable 260K ops with the cursor manager change – so the master w/cursor manager is about the same as base 3.5. locking differences between 3.5 and 3.5 master with cursor manager. Cursor manager change was successful in that it practically eliminated CursorManager lock. 0.02% 0.01% conn39 libpthread-2.22.so [.] pthread_mutex_unlock 3.5 locking
0.02% 0.01% conn101 libpthread-2.22.so [.] pthread_mutex_lock 3.5 with cursor manager patch
| |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 30/May/17 ] | |||||||||||||||||||||||||||||||||||||||||
|
When I run performance tests with ycsb, 50% read/50% write, I get the 3.4 base – about 280K ops I think my set-up is the same for each – have one of two conclusions: I will be available to provide performance data ... Jim | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 30/May/17 ] | |||||||||||||||||||||||||||||||||||||||||
|
When I run performance tests with ycsb, 50% read/50% write, I get the following nominal throughput results: 3.4 base – about 280K ops I think my set-up is the same for each – have one of two conclusions: I will be available to provide performance data ... Jim | |||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 26/May/17 ] | |||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'cswanson310', u'name': u'Charlie Swanson', u'email': u'charlie.swanson@mongodb.com'}Message: | |||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 20/Oct/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi jvf, Correct, this issue is currently on our backlog, meaning that it is on the roadmap but is not currently being worked on. We are not treating scalability improvements to the read path as a gating issue for the 3.4.0 release. Unfortunately, this means that strictly required work (e.g. ensuring a smooth user experience for the 3.2=>3.4 upgrade) has taken precedence. We are interested in making some related improvements to cursor management in Best, | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 17/Oct/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
I have not touched bases on this in a while – what is the current status? (I see back log, but have you'all made progress.) | |||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 17/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi jvf, Thanks for providing the additional data. I believe your results demonstrate that the patch I provided earlier eliminates the contention issue within the CursorManager. However, it does clearly show that there is contention on the ShardingState lock. It appears that in order to improve the scalability of the read path, there may be several bottlenecks to eliminate, both in the query and sharding subsystems. Latching done in mongos or for recording diagnostic statistics on mongod could also potentially be problematic once the CursorManager and ShardingState bottlenecks are fixed. While we take scalability issues seriously, our team is currently focused on reaching feature completeness for 3.4 features. As time permits, we plan to revisit performance issues and do further performance testing post feature-complete. Thanks very much for your time and assistance running tests for this issue! For now I am putting this ticket back on the query team's backlog, but please do continue to watch for updates. Best, | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 14/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
perf annotate, report graph, report flat in attached gzip file - let me know if you want more | |||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 14/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi jvf, Interesting, thanks for running the experiment. This suggests to me that we fixed the problem with CursorManager::_mutex, resulting in a new performance bottleneck while reading an in-memory data structure containing sharding-related metadata. I will have to investigate if there anything that can be done to reduce the amount of time spent in the ShardingState data structure's critical sections or if there is an opportunity to apply a similar partitioning strategy for synchronizing access to ShardingState. If you could share the profile data you collected, I would be interested in taking a look. Best, | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 13/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Ran the tests yesterday and today. full spreadsheet attached Essentially no change (slight improvement at 320 threads) – better than base but still negative scaling. I also got some performance data – profile and trace. profile indicates about 30% of the cpu in raw_spin_lock. perf graph snippit looks like:
and
This, of course, indicates heavy contention for pthread_mutex_lock This was a test where ycsb had 240 threads – vmstat shows an average of something like 100 threads and idle time of 35+% There is lots of data – happy to share. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 10/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
yes, I can re-run the tests. The machine is not available today, but I can probably get it on Monday | |||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 10/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi jvf, Thanks very much for doing the work to produce the perf numbers. Your results are pretty interesting. As a side note, could you please directly comment on this ticket for future communication about this issue? I have been alerted to the fact that messages to the JIRA email queue can sometimes get lost. This will also help to preserve context about our investigation for future reference and for other users who may be interested in the technical details of this issue. Based on the amount of concurrency in your test, it seems very plausible that the number of partitions in the initial prototype is far too low to make that much of a difference. I've produced a second prototype build, very similar to the first, which has 256 partitions instead of 8. If 256 partitions doesn't do the trick, then we may have to re-evaluate our approach. The new build is available here: Would it be possible to rerun your tests? I have also attached file server21745_2.patch with the corresponding code changes, based on top of revision 3ee21ae53b1aedf0820e627f37cf502871e7a0d2 of the master branch. Best, | |||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 02/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi jvf, Sure, no problem. I've attached file server21754.patch to this ticket. The patch applies cleanly as of revision 656480dfdb308884868a23ac41f9708b1b376431 of the master branch. Let me know if you need anything else from me for your testing. Best, | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 02/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi, David – Moving along on testing your change. You did not send the source for the Jim From: "David Storch (JIRA)" <jira@mongodb.org> David Storch updated 7.2, mongo 3.2-pre, 4 10Gig E's hard wired to another firestone running 4 ---------------------- | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 24/May/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Finally got the hardware – getting it set up now – progress when it From: "David Storch (JIRA)" <jira@mongodb.org> David Storch updated 7.2, mongo 3.2-pre, 4 10Gig E's hard wired to another firestone running 4 ---------------------- | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 21/May/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi – I will have a projection of when I will get the machines on Monday. Jim From: "David Storch (JIRA)" <jira@mongodb.org> David Storch commented on Hi jvf, thanks for the update. There's no rush. I'll put this ticket in 7.2, mongo 3.2-pre, 4 10Gig E's hard wired to another firestone running 4 ---------------------- | |||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 19/May/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi jvf, thanks for the update. There's no rush. I'll put this ticket in a "waiting" state, pending the perf testing numbers. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 18/May/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi Dave – Right now, I do not have the configuration we did the original runs on ... So yes, I want to do this test ... but it will take me some time to get Jim From: "David Storch (JIRA)" <jira@mongodb.org> David Storch edited comment on Hi jvf, I have a proposed patch in code review which implements a partitioning I'd be very curious to see the results of running the same YSCB workload Best, 7.2, mongo 3.2-pre, 4 10Gig E's hard wired to another firestone running 4 ---------------------- | |||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 18/May/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi jvf, I have a proposed patch in code review which implements a partitioning strategy in order to alleviate the CursorManager lock performance bottleneck. On our performance testing infrastructure, the patch doesn't show any major benefits, though I believe that the scaling problem is likely be more severe on your hardware. Therefore, would you be willing to test a build of this patch? The Enterprise RHEL PPC84LE binaries are available for download here: I'd be very curious to see the results of running the same YSCB workload against this build when you have them! Best, | |||||||||||||||||||||||||||||||||||||||||
| Comment by James Wahlin [ 08/Apr/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi Jim, An assignee of "Backlog" just means that no one is actively working on it at this point in time. Fix Version reflects our delivery target, which is currently during the 3.3 development cycle for release in MongoDB 3.4. Thanks, | |||||||||||||||||||||||||||||||||||||||||
| Comment by Jim Van Fleet [ 08/Apr/16 ] | |||||||||||||||||||||||||||||||||||||||||
|
Does "Backlog" mean this is no longer being worked on? | |||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 03/Dec/15 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi jvf, Thanks for the report. We actually noticed this during perf testing for the 3.2 release, although I believe this problem affects 3.0.x and 2.6.x versions as well. We have a few ideas for ways to fix this, so stay tuned. For the time being, I've moved the ticket into "Needs Triage" state. Best, |