Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.3.5
Component/s: WiredTiger
Labels:
None
Environment:
HW has fast SSD (~100k IOPs), 24 cores / 48 HW threads, Centos 4.0.9, 256G RAM, MongoDB 3.3.5 built from source and using jemalloc

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Steps To Reproduce:

Hide

Server has 24 cores, 48 HW threads, CentOS Linux 4.0.9, fast SSD (100k IOPs), 256G of RAM but only 50G available to mongod and the OS page cache.

I used MongoDB 3.3.5 compiled from source and jemalloc

The scripts are here:
https://github.com/mdcallag/mytools/tree/master/bench/run_linkbench.

The command line was the following. It does a load with maxid1=1B, then 24 1-hour loops with 16 concurrent clients, then 1-hour loops with 1, 4, 8, ... 40, 44, 48 concurrent clients:
bash all.sh wt.1b.zlib /data/mysql/mo335/bin/mongo /data/mysql/mo335/data 1000000001 md2 16 3600 mongo ignore 24 1 4 8 12 16 20 24 28 32 36 40 44 48 &

Show
Server has 24 cores, 48 HW threads, CentOS Linux 4.0.9, fast SSD (100k IOPs), 256G of RAM but only 50G available to mongod and the OS page cache. I used MongoDB 3.3.5 compiled from source and jemalloc The scripts are here: https://github.com/mdcallag/mytools/tree/master/bench/run_linkbench . The command line was the following. It does a load with maxid1=1B, then 24 1-hour loops with 16 concurrent clients, then 1-hour loops with 1, 4, 8, ... 40, 44, 48 concurrent clients: bash all.sh wt.1b.zlib /data/mysql/mo335/bin/mongo /data/mysql/mo335/data 1000000001 md2 16 3600 mongo ignore 24 1 4 8 12 16 20 24 28 32 36 40 44 48 &
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

I am running linkbench via scripts described below. The test does a load, 24 hours of query tests with 16 concurrent clients, then a sequence of 1 hour query tests with increasing concurrency starting at one and increasing to 48.

The test was run with mmapv1, WiredTiger (zlib & snappy), RocksDB (zlib & snappy). The RocksDB tests are still in progress so I don't know whether they will have a problem. The WiredTiger+snappy test finished. The WiredTiger+zlib test appears to have hung with 44 concurrent clients. Given that the server has 48 HW threads I wonder if contention on spinlocks is the problem.

By "hang" I mean that QPS has dropped from ~1500 to something close to zero. I don't have mongostat on these servers, I will try to install it after creating this bug. Looking at PMP, to be attached, shows all threads in eviction code. Neither the mongod log file, nor the client output files have been updated for 2 hours so I call this a hang. The "secs_running" attribute in db.currentOp() output shows ~10,000 seconds for all queries.

This is the QPS for each number of concurrent clients

   1     4     8    12    16    20    24    28    32    36    40    44    48 concurrent clients
1137 4281 4032 3246 3199 3038 2918 2815 2802 2839 2722 2751 2732 snappy
651 2400 2312 2085 2014 1847 1878 1826 1802 1465 1556    x    x zlib

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

h.fail
Jun 20 2016 08:05:56 PM UTC
34 kB
Mark Callaghan
o.curop
Jun 20 2016 08:05:56 PM UTC
44 kB
Mark Callaghan
h.stathang
Jun 20 2016 08:09:02 PM UTC
31 kB
Mark Callaghan
metrics.interim
Jun 20 2016 08:28:29 PM UTC
12 kB
Mark Callaghan
metrics.2016-06-20T06-54-46Z-00000
Jun 20 2016 08:29:17 PM UTC
6.55 MB
Mark Callaghan
metrics.2016-06-19T15-19-46Z-00000
Jun 20 2016 08:29:57 PM UTC
9.96 MB
Mark Callaghan
metrics.2016-06-14T15-01-56Z-00000
Jun 20 2016 09:49:47 PM UTC
9.98 MB
Mark Callaghan
metrics.2016-06-16T00-06-56Z-00000
Jun 20 2016 09:49:47 PM UTC
9.99 MB
Mark Callaghan
metrics.2016-06-17T09-11-56Z-00000
Jun 20 2016 09:49:47 PM UTC
9.98 MB
Mark Callaghan
metrics.2016-06-18T18-11-56Z-00000
Jun 20 2016 09:49:47 PM UTC
10.00 MB
Mark Callaghan
metrics.2016-06-11T23-00-02Z-00000
Jun 20 2016 09:51:36 PM UTC
10.00 MB
Mark Callaghan
metrics.2016-06-13T05-51-56Z-00000
Jun 20 2016 09:51:36 PM UTC
9.99 MB
Mark Callaghan

Assignee:: David Hows (Inactive)
Reporter:: Mark Callaghan
Participants:: Bruce Lucas, Daniel Pasette, David Hows, Mark Callaghan, Ramon Fernandez Marina
Votes:: 0 Vote for this issue
Watchers:: 12 Start watching this issue

Created:: Jun 20 2016 08:02:30 PM UTC
Updated:: Apr 05 2017 04:43:46 PM UTC
Resolved:: Nov 21 2016 03:54:00 AM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates