[SERVER-24668] WiredTiger falls over on linkbench when IO-bound with high concurrency Created: 20/Jun/16  Updated: 05/Apr/17  Resolved: 21/Nov/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.3.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mark Callaghan Assignee: David Hows
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

HW has fast SSD (~100k IOPs), 24 cores / 48 HW threads, Centos 4.0.9, 256G RAM, MongoDB 3.3.5 built from source and using jemalloc


Attachments: File h.fail     File h.stathang     File metrics.2016-06-11T23-00-02Z-00000     File metrics.2016-06-13T05-51-56Z-00000     File metrics.2016-06-14T15-01-56Z-00000     File metrics.2016-06-16T00-06-56Z-00000     File metrics.2016-06-17T09-11-56Z-00000     File metrics.2016-06-18T18-11-56Z-00000     File metrics.2016-06-19T15-19-46Z-00000     File metrics.2016-06-20T06-54-46Z-00000     File metrics.interim     File o.curop    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Server has 24 cores, 48 HW threads, CentOS Linux 4.0.9, fast SSD (100k IOPs), 256G of RAM but only 50G available to mongod and the OS page cache.

I used MongoDB 3.3.5 compiled from source and jemalloc

The scripts are here:
https://github.com/mdcallag/mytools/tree/master/bench/run_linkbench.

The command line was the following. It does a load with maxid1=1B, then 24 1-hour loops with 16 concurrent clients, then 1-hour loops with 1, 4, 8, ... 40, 44, 48 concurrent clients:
bash all.sh wt.1b.zlib /data/mysql/mo335/bin/mongo /data/mysql/mo335/data 1000000001 md2 16 3600 mongo ignore 24 1 4 8 12 16 20 24 28 32 36 40 44 48 &

Participants:

 Description   

I am running linkbench via scripts described below. The test does a load, 24 hours of query tests with 16 concurrent clients, then a sequence of 1 hour query tests with increasing concurrency starting at one and increasing to 48.

The test was run with mmapv1, WiredTiger (zlib & snappy), RocksDB (zlib & snappy). The RocksDB tests are still in progress so I don't know whether they will have a problem. The WiredTiger+snappy test finished. The WiredTiger+zlib test appears to have hung with 44 concurrent clients. Given that the server has 48 HW threads I wonder if contention on spinlocks is the problem.

By "hang" I mean that QPS has dropped from ~1500 to something close to zero. I don't have mongostat on these servers, I will try to install it after creating this bug. Looking at PMP, to be attached, shows all threads in eviction code. Neither the mongod log file, nor the client output files have been updated for 2 hours so I call this a hang. The "secs_running" attribute in db.currentOp() output shows ~10,000 seconds for all queries.

This is the QPS for each number of concurrent clients

   1     4     8    12    16    20    24    28    32    36    40    44    48 concurrent clients
1137 4281 4032 3246 3199 3038 2918 2815 2802 2839 2722 2751 2732 snappy
651 2400 2312 2085 2014 1847 1878 1826 1802 1465 1556    x    x zlib



 Comments   
Comment by David Hows [ 18/Nov/16 ]

We’ve done a bunch of testing, that shows the most recent version of MongoDB performs stably on the test case. We were never able to reproduce the symptoms you described, however they show similarity with a problem fixed in WT-2560. That ticket deals with a case where a large number of threads could be stuck trying to update the oldest transaction id in the system and wind up competing with each-other to do this. The stack traces you attached show the same thing happening, with threads blocked within the __wt_update_txn_oldest function and the workload you are using does run quite heavily multithreaded.

As mentioned, I tried a number of ways to reproduce this issue with more recent versions of MongoDB but was unable to get the particular stalls you mentioned to re-emerge. That said, while running the tests I was able to isolate a performance regression and see that there was some inconsistency in throughput in the workload. I'm happy to report that these performance problems look to be either minimised or resolved within the MongoDB 3.4 releases.

Given this, I think we have exhausted all the potential avenues for investigation on this ticket for now and will mark it as "gone away", please re-open if you have any further questions.

Thanks,
David

Comment by Mark Callaghan [ 24/Jun/16 ]

I can but that will take a few weeks. Next week is busy for me and I have to figure out how to build with clang or gcc 5. My build scripts are non-trivial because I can't use default paths for gcc or libraries for binaries that are to run in production.

Comment by Bruce Lucas (Inactive) [ 24/Jun/16 ]

Hi Mark,

Both PMP samples show a lot of threads stuck in __wt_txn_update_oldest with stack traces similar to WT-2560, so I wonder if that is the cause. It's marked fixed in 3.3.8, would you be able to give that a try?

Thanks,
Bruce

Comment by Mark Callaghan [ 22/Jun/16 ]

Queries finished after ~47,000 seconds. The average QPS for that one hour interval dropped from ~1800 to less than 100. 99th percentile response time increased by about 1.5X during that one hour interval so more than one query had a slow response.

2016-06-20T23:22:36.760-0700 I COMMAND  [conn1435] command graph-linkbench.$cmd command: delete { delete: "node", ordered: true, deletes: [ { q: { _id: 74308340, type: 2048 }, li
mit: 0 } ] } numYields:0 reslen:60 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 3 } }, Metadata: { acq
uireCount: { w: 1 } } } protocol:op_query 47234253ms

Comment by Mark Callaghan [ 20/Jun/16 ]

secs_running is now over 17,000 seconds

Comment by Mark Callaghan [ 20/Jun/16 ]

more metrics files

Comment by Mark Callaghan [ 20/Jun/16 ]

more metrics files

Comment by Mark Callaghan [ 20/Jun/16 ]

mongo.conf settings:

processManagement:
  fork: true
systemLog:
  destination: file
  path: /data/mysql/mo335/log
  logAppend: true
storage:
  syncPeriodSecs: 600
  dbPath: /data/mysql/mo335/data
  journal:
    enabled: true
 
operationProfiling.slowOpThresholdMs: 2000
replication.oplogSizeMB: 4000
 
storage.wiredTiger.collectionConfig.blockCompressor: zlib
storage.wiredTiger.engineConfig.journalCompressor: none
storage.wiredTiger.engineConfig.cacheSizeGB: 10

Comment by Daniel Pasette (Inactive) [ 20/Jun/16 ]

Just need the ones that correspond to the test run in question. The files should be named with a datetime.

Comment by Ramon Fernandez Marina [ 20/Jun/16 ]

Mark, I've created an upload portal where you can send data that's above the JIRA limit and up to 5GB per file – feel free to use it as needed.

Comment by Mark Callaghan [ 20/Jun/16 ]

Uploaded interim and the files from the past 2 days. There are ~5 others from several days back. Let me know if you want all of them. They are big.

Comment by Daniel Pasette (Inactive) [ 20/Jun/16 ]

Hi Mark,
Can you attach the files located in the diagnostic.data directory?

Comment by Mark Callaghan [ 20/Jun/16 ]

The test uses linkbenchx with bin/linkbench changed to use a different main -> com.facebook.LinkBench.LinkBenchDriver
https://github.com/Percona-Lab/linkbenchX

Comment by Mark Callaghan [ 20/Jun/16 ]

PMP when db.serverStatus() hangs

Comment by Mark Callaghan [ 20/Jun/16 ]

Tried to provide output from db.serverStatus() but that command hangs. Uploading PMP from that hang.

Comment by Mark Callaghan [ 20/Jun/16 ]

Two files:

  • h.fail is output from PMP
  • o.curop is db.currentOp()
Generated at Thu Feb 08 04:07:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.