[SERVER-24668] WiredTiger falls over on linkbench when IO-bound with high concurrency Created: 20/Jun/16 Updated: 05/Apr/17 Resolved: 21/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.3.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mark Callaghan | Assignee: | David Hows |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
HW has fast SSD (~100k IOPs), 24 cores / 48 HW threads, Centos 4.0.9, 256G RAM, MongoDB 3.3.5 built from source and using jemalloc |
||
| Attachments: |
|
| Backwards Compatibility: | Fully Compatible |
| Operating System: | ALL |
| Steps To Reproduce: | Server has 24 cores, 48 HW threads, CentOS Linux 4.0.9, fast SSD (100k IOPs), 256G of RAM but only 50G available to mongod and the OS page cache. I used MongoDB 3.3.5 compiled from source and jemalloc The scripts are here: The command line was the following. It does a load with maxid1=1B, then 24 1-hour loops with 16 concurrent clients, then 1-hour loops with 1, 4, 8, ... 40, 44, 48 concurrent clients: |
| Participants: |
| Description |
|
I am running linkbench via scripts described below. The test does a load, 24 hours of query tests with 16 concurrent clients, then a sequence of 1 hour query tests with increasing concurrency starting at one and increasing to 48. The test was run with mmapv1, WiredTiger (zlib & snappy), RocksDB (zlib & snappy). The RocksDB tests are still in progress so I don't know whether they will have a problem. The WiredTiger+snappy test finished. The WiredTiger+zlib test appears to have hung with 44 concurrent clients. Given that the server has 48 HW threads I wonder if contention on spinlocks is the problem. By "hang" I mean that QPS has dropped from ~1500 to something close to zero. I don't have mongostat on these servers, I will try to install it after creating this bug. Looking at PMP, to be attached, shows all threads in eviction code. Neither the mongod log file, nor the client output files have been updated for 2 hours so I call this a hang. The "secs_running" attribute in db.currentOp() output shows ~10,000 seconds for all queries. This is the QPS for each number of concurrent clients
|
| Comments |
| Comment by David Hows [ 18/Nov/16 ] | ||||||||||||||||||
|
We’ve done a bunch of testing, that shows the most recent version of MongoDB performs stably on the test case. We were never able to reproduce the symptoms you described, however they show similarity with a problem fixed in As mentioned, I tried a number of ways to reproduce this issue with more recent versions of MongoDB but was unable to get the particular stalls you mentioned to re-emerge. That said, while running the tests I was able to isolate a performance regression and see that there was some inconsistency in throughput in the workload. I'm happy to report that these performance problems look to be either minimised or resolved within the MongoDB 3.4 releases. Given this, I think we have exhausted all the potential avenues for investigation on this ticket for now and will mark it as "gone away", please re-open if you have any further questions. Thanks, | ||||||||||||||||||
| Comment by Mark Callaghan [ 24/Jun/16 ] | ||||||||||||||||||
|
I can but that will take a few weeks. Next week is busy for me and I have to figure out how to build with clang or gcc 5. My build scripts are non-trivial because I can't use default paths for gcc or libraries for binaries that are to run in production. | ||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 24/Jun/16 ] | ||||||||||||||||||
|
Hi Mark, Both PMP samples show a lot of threads stuck in __wt_txn_update_oldest with stack traces similar to Thanks, | ||||||||||||||||||
| Comment by Mark Callaghan [ 22/Jun/16 ] | ||||||||||||||||||
|
Queries finished after ~47,000 seconds. The average QPS for that one hour interval dropped from ~1800 to less than 100. 99th percentile response time increased by about 1.5X during that one hour interval so more than one query had a slow response.
| ||||||||||||||||||
| Comment by Mark Callaghan [ 20/Jun/16 ] | ||||||||||||||||||
|
secs_running is now over 17,000 seconds | ||||||||||||||||||
| Comment by Mark Callaghan [ 20/Jun/16 ] | ||||||||||||||||||
|
more metrics files | ||||||||||||||||||
| Comment by Mark Callaghan [ 20/Jun/16 ] | ||||||||||||||||||
|
more metrics files | ||||||||||||||||||
| Comment by Mark Callaghan [ 20/Jun/16 ] | ||||||||||||||||||
|
mongo.conf settings:
| ||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 20/Jun/16 ] | ||||||||||||||||||
|
Just need the ones that correspond to the test run in question. The files should be named with a datetime. | ||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 20/Jun/16 ] | ||||||||||||||||||
|
Mark, I've created an upload portal where you can send data that's above the JIRA limit and up to 5GB per file – feel free to use it as needed. | ||||||||||||||||||
| Comment by Mark Callaghan [ 20/Jun/16 ] | ||||||||||||||||||
|
Uploaded interim and the files from the past 2 days. There are ~5 others from several days back. Let me know if you want all of them. They are big. | ||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 20/Jun/16 ] | ||||||||||||||||||
|
Hi Mark, | ||||||||||||||||||
| Comment by Mark Callaghan [ 20/Jun/16 ] | ||||||||||||||||||
|
The test uses linkbenchx with bin/linkbench changed to use a different main -> com.facebook.LinkBench.LinkBenchDriver | ||||||||||||||||||
| Comment by Mark Callaghan [ 20/Jun/16 ] | ||||||||||||||||||
|
PMP when db.serverStatus() hangs | ||||||||||||||||||
| Comment by Mark Callaghan [ 20/Jun/16 ] | ||||||||||||||||||
|
Tried to provide output from db.serverStatus() but that command hangs. Uploading PMP from that hang. | ||||||||||||||||||
| Comment by Mark Callaghan [ 20/Jun/16 ] | ||||||||||||||||||
|
Two files:
|