[SERVER-16269] WiredTiger blocks queries and updates for many seconds. Created: 21/Nov/14 Updated: 03/Dec/21 Resolved: 29/Dec/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | 2.8.0-rc0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | John Page | Assignee: | Michael Cahill (Inactive) |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | wiredtiger | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||
| Description |
|
Comparing MMAP to WiredTiger on an 8 core server with 32GB Ram and local SSD (AWS Instance). Throwing a mix of insert/update/query at the server and wiredtiger viewed in mongostat shows really inconsistent performance - also many operations in log are poking about the 100ms mark with inserts taking >1s in some cases. I dont see this inconsistency with mmap v1. mmapv1 engine output from mongostat looks like this.
WT looks like this
|
| Comments |
| Comment by Michael Cahill (Inactive) [ 29/Dec/14 ] | ||
|
This has been split into | ||
| Comment by Daniel Pasette (Inactive) [ 23/Dec/14 ] | ||
|
We've seen that disabling Transparent Huge Pages makes the "pauses" go away in some workloads. In order to do, the user had to also change /sys/kernel/mm/transparent_hugepage/defrag to never. Keeping ticket open for further analysis. | ||
| Comment by Keith Bostic (Inactive) [ 23/Dec/14 ] | ||
|
asya, at this point, are you running with the sched_yield calls backing off to sleep, or with the the sched_yield calls just spinning? | ||
| Comment by Keith Bostic (Inactive) [ 22/Dec/14 ] | ||
|
bruce.lucas@10gen.com: all of our tests were on AWS Linux machines. | ||
| Comment by Asya Kamsky [ 21/Dec/14 ] | ||
|
robert.j.moore@allanbank.com thank you for the offer - I also can reproduce this with YCSB (and suspect a related ticket I filed is also this as a root cause). Fewer threads in fact show less of this problem but at the cost of lower overall throughput. I'm testing a couple of internal configurations and settings to determine which path may be most likely to alleviate this. | ||
| Comment by Robert Moore [ 21/Dec/14 ] | ||
|
Asya - I am seeing these same pauses using YCSB ( Where is a log with 1 connection:
The length of the pause does seem to be related to the number of connections with more connections causing longer pauses but I need to collect more data to be sure. I was originally seeing this problem with our application stress test but have not got YCSB exhibiting the pauses regularly so I can also test any log levels or candidates you may have. Rob. | ||
| Comment by Asya Kamsky [ 21/Dec/14 ] | ||
|
John, how many cores do you have relative to 128 threads? If you lower the number of threads to be closer to the number of cores (say no more than double the number of cores) how does that affect the worst latency? | ||
| Comment by John Page [ 19/Dec/14 ] | ||
|
If you mean my workload the WT cursor is hidden from the mongo API as far | ||
| Comment by John Page [ 17/Dec/14 ] | ||
|
Let me know as soon as you have a candidate fix and I can test for you. | ||
| Comment by John Page [ 16/Dec/14 ] | ||
|
The code for this is available - it's very simple - 128 processes on another box doing 16% insert, 34% update, 50% read - all on primary key. No TTL collections at all. Records are 30 fields, approx 1K | ||
| Comment by John Page [ 16/Dec/14 ] | ||
|
I've tried nojournal to disbale the write ahaead log Still seeing 15second pause for update and query at 180 seconds in and 50 second pause at 360 seconds in | ||
| Comment by John Page [ 16/Dec/14 ] | ||
|
Uploaded as requested. | ||
| Comment by John Page [ 16/Dec/14 ] | ||
|
Gzipping now - there is 1GB of it. | ||
| Comment by John Page [ 16/Dec/14 ] | ||
|
Tried nojournal to turn of WAL in WT - it's not that. | ||
| Comment by John Page [ 16/Dec/14 ] | ||
|
Ah -except logs can you give me a time range from the files you have and I think I will have the log. | ||
| Comment by John Page [ 16/Dec/14 ] | ||
|
I have many more graphs and stats I can share if you want them bruce.lucas@10gen.com | ||
| Comment by John Page [ 16/Dec/14 ] | ||
|
Logs uploaded via ssh. | ||
| Comment by John Page [ 16/Dec/14 ] | ||
|
30 minute run, 128 threads at client end, seeing Queries and Updates blocking for over a minute. | ||
| Comment by John Page [ 09/Dec/14 ] | ||
|
Also - this is default parameters, 2.8.0-RC-2 | ||
| Comment by John Page [ 05/Dec/14 ] | ||
|
I'm doing some much more extensive testing and rechecking/ruling out any effects at my end - will report back once I have some more solid info. |