Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24668

WiredTiger falls over on linkbench when IO-bound with high concurrency

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.3.5
    • Component/s: WiredTiger
    • Labels:
      None
    • Environment:
      HW has fast SSD (~100k IOPs), 24 cores / 48 HW threads, Centos 4.0.9, 256G RAM, MongoDB 3.3.5 built from source and using jemalloc
    • Fully Compatible
    • ALL
    • Hide

      Server has 24 cores, 48 HW threads, CentOS Linux 4.0.9, fast SSD (100k IOPs), 256G of RAM but only 50G available to mongod and the OS page cache.

      I used MongoDB 3.3.5 compiled from source and jemalloc

      The scripts are here:
      https://github.com/mdcallag/mytools/tree/master/bench/run_linkbench.

      The command line was the following. It does a load with maxid1=1B, then 24 1-hour loops with 16 concurrent clients, then 1-hour loops with 1, 4, 8, ... 40, 44, 48 concurrent clients:
      bash all.sh wt.1b.zlib /data/mysql/mo335/bin/mongo /data/mysql/mo335/data 1000000001 md2 16 3600 mongo ignore 24 1 4 8 12 16 20 24 28 32 36 40 44 48 &

      Show
      Server has 24 cores, 48 HW threads, CentOS Linux 4.0.9, fast SSD (100k IOPs), 256G of RAM but only 50G available to mongod and the OS page cache. I used MongoDB 3.3.5 compiled from source and jemalloc The scripts are here: https://github.com/mdcallag/mytools/tree/master/bench/run_linkbench . The command line was the following. It does a load with maxid1=1B, then 24 1-hour loops with 16 concurrent clients, then 1-hour loops with 1, 4, 8, ... 40, 44, 48 concurrent clients: bash all.sh wt.1b.zlib /data/mysql/mo335/bin/mongo /data/mysql/mo335/data 1000000001 md2 16 3600 mongo ignore 24 1 4 8 12 16 20 24 28 32 36 40 44 48 &

      I am running linkbench via scripts described below. The test does a load, 24 hours of query tests with 16 concurrent clients, then a sequence of 1 hour query tests with increasing concurrency starting at one and increasing to 48.

      The test was run with mmapv1, WiredTiger (zlib & snappy), RocksDB (zlib & snappy). The RocksDB tests are still in progress so I don't know whether they will have a problem. The WiredTiger+snappy test finished. The WiredTiger+zlib test appears to have hung with 44 concurrent clients. Given that the server has 48 HW threads I wonder if contention on spinlocks is the problem.

      By "hang" I mean that QPS has dropped from ~1500 to something close to zero. I don't have mongostat on these servers, I will try to install it after creating this bug. Looking at PMP, to be attached, shows all threads in eviction code. Neither the mongod log file, nor the client output files have been updated for 2 hours so I call this a hang. The "secs_running" attribute in db.currentOp() output shows ~10,000 seconds for all queries.

      This is the QPS for each number of concurrent clients

         1     4     8    12    16    20    24    28    32    36    40    44    48 concurrent clients
      1137 4281 4032 3246 3199 3038 2918 2815 2802 2839 2722 2751 2732 snappy
      651 2400 2312 2085 2014 1847 1878 1826 1802 1465 1556    x    x zlib
      

        1. h.fail
          34 kB
        2. h.stathang
          31 kB
        3. metrics.2016-06-11T23-00-02Z-00000
          10.00 MB
        4. metrics.2016-06-13T05-51-56Z-00000
          9.99 MB
        5. metrics.2016-06-14T15-01-56Z-00000
          9.98 MB
        6. metrics.2016-06-16T00-06-56Z-00000
          9.99 MB
        7. metrics.2016-06-17T09-11-56Z-00000
          9.98 MB
        8. metrics.2016-06-18T18-11-56Z-00000
          10.00 MB
        9. metrics.2016-06-19T15-19-46Z-00000
          9.96 MB
        10. metrics.2016-06-20T06-54-46Z-00000
          6.55 MB
        11. metrics.interim
          12 kB
        12. o.curop
          44 kB

            Assignee:
            david.hows David Hows
            Reporter:
            mdcallag Mark Callaghan
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: