[SERVER-19273] Short range scan using YCSB cause mongodb crash Created: 02/Jul/15 Updated: 25/Aug/15 Resolved: 25/Aug/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.0.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Milind Shah | Assignee: | Ramon Fernandez Marina |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | crash | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
RAM: 128GB |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Operating System: | Linux | ||||||||||||||||
| Steps To Reproduce: | 1) Insert 30 million rows of 1field - 100 bytes size |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
I am running YCSB on a 3 node setup. I insert 30 million rows as mentioned in the blog post https://www.mongodb.com/blog/post/performance-testing-mongodb-30-part-1-throughput-improvements-measured-ycsb here. The row has 1 field of 100 bytes of data. After inserting the data, I run YCSB short range scan with scan length of 50. During the scan my mongod process went down and in the mongod log, I saw:
|
| Comments |
| Comment by Ramon Fernandez Marina [ 29/Jul/15 ] | |||||||||||||||||||
|
I forgot to add that Cheers, | |||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 29/Jul/15 ] | |||||||||||||||||||
|
milindshah, the error message
appeared earlier in You may be able to work around it by increasing session_max, for example:
Note that this will increase memory consumption. The second alternative is to set a low timeout for idle cursors; for example, to close cursors after 60 seconds of inactivity:
Please note that setting a low value for cursorTimeoutMillis may negatively impact applications that rely on idle cursors remaining open for values higher than cursorTimeoutMillis. robert.j.moore@allanbank.com, after using 3.0.3-SNAPSHOT, does the server still aborts with those Cannot allocate memory errors? Thanks, | |||||||||||||||||||
| Comment by J Rassi [ 09/Jul/15 ] | |||||||||||||||||||
|
martin.bligh: could you weigh in for triage, please? I don't have much context on this issue, but Mathias suggests that this is either a dup of | |||||||||||||||||||
| Comment by Jeffrey Yemin [ 08/Jul/15 ] | |||||||||||||||||||
|
I reproduced this issue and discovered a defect in the Java driver, which is now linked to this ticket. | |||||||||||||||||||
| Comment by Robert Moore [ 08/Jul/15 ] | |||||||||||||||||||
|
I spoke too soon. The stalls are still happening with all of the threads in:
| |||||||||||||||||||
| Comment by Robert Moore [ 08/Jul/15 ] | |||||||||||||||||||
|
I just ran the same tests with the 3.0.3-SNAPSHOT jar and it did not stall during the scans operations. Could the extra calls have been causing the connections to not get returned to the connection pool? Rob. | |||||||||||||||||||
| Comment by Jeffrey Yemin [ 08/Jul/15 ] | |||||||||||||||||||
|
robert.j.moore@allanbank.com the "killcursors: found 0 of 1" messages are due to | |||||||||||||||||||
| Comment by Mark Callaghan [ 07/Jul/15 ] | |||||||||||||||||||
|
problem being discussed as next YCSB release is being tested | |||||||||||||||||||
| Comment by Asya Kamsky [ 06/Jul/15 ] | |||||||||||||||||||
|
Ah, sorry just saw 25 threads in the repro section. Will try that. | |||||||||||||||||||
| Comment by Asya Kamsky [ 06/Jul/15 ] | |||||||||||||||||||
|
What's the exact YCSB command - how many threads are you using? I'd like to reproduce this here. | |||||||||||||||||||
| Comment by Robert Moore [ 05/Jul/15 ] | |||||||||||||||||||
|
Attaching the script I used to run the tests. It downloads the YCSB, builds and runs the workloads. Feel free to modify to just run the mongodb driver with 5 threads and 5 connections which shows the worst behaviour. Run with:
| |||||||||||||||||||
| Comment by Robert Moore [ 05/Jul/15 ] | |||||||||||||||||||
|
The tests finally finished and I see the same behaviour for the cursors with 3.0.4:
| |||||||||||||||||||
| Comment by Robert Moore [ 05/Jul/15 ] | |||||||||||||||||||
|
As a workaround you can use the `mongodb-async` driver with YCSB. You might also see a performance increase as a result. Just change
to
Rob. | |||||||||||||||||||
| Comment by Robert Moore [ 05/Jul/15 ] | |||||||||||||||||||
|
Logs for 2.2.7 and 2.4.14. 2.6.10 is showing the same behaviour but is still running. I don't have a stack trace for 2.4.14 since it happened in the middle of the night. | |||||||||||||||||||
| Comment by Robert Moore [ 05/Jul/15 ] | |||||||||||||||||||
|
asya - I am seeing something strange with the out of the box workloade using the MongoDB Inc driver. After having jeff.yemin look at the code I am pretty sure the cursors are getting closed, or trying, but I am not convinced they are the right cursors. I am going through all of the production MongoDB versions since 1.8 and testing the YCSB 0.2.0-RC3 and for the workloade I am getting timeouts trying to get connections. I will attach the logs as I get them but for the 2.0.9 server I see this from the script:
(Ignore the 20 to 01 hour change. I am using the wrong date expression in the script. It is 20 to 21 hours.) In the MongoDB logs for the 2015-07-04 20:54:31 period I see lots of:
It appears in the YCSB logs (which I will attach) that the driver lost track of the connections completely and that caused all of the threads to eventually have to timeout waiting for a connection. I'll attach a stacktrace showing the threads are all waiting for a connection. I know that 2.0.9 is ancient but I am seeing the same thing with 2.2.7 so I think it is something systemic. I will report back once all of the server versions finish if I see the same thing on more recent versions. Rob. | |||||||||||||||||||
| Comment by Milind Shah [ 03/Jul/15 ] | |||||||||||||||||||
|
Thanks Asya. I am using the latest YCSB from https://github.com/brianfrankcooper/YCSB. I will not use the scan test in that case. | |||||||||||||||||||
| Comment by Asya Kamsky [ 02/Jul/15 ] | |||||||||||||||||||
|
Based on the Google Groups discussion, if you want to use "scan" operation, you should NOT be using a hashed shard key - hashing the shard key removes the ability to do efficient scans over the index. In addition, it appears that the official YCSB master may have introduced a cursor leak in the scan method, so I would recommend not including "scan" operation in your tests (at least for the moment). Asya | |||||||||||||||||||
| Comment by Asya Kamsky [ 02/Jul/15 ] | |||||||||||||||||||
|
Which YCSB version are you using for these tests? |