Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-29311

WT performance drops to zero during cache eviction

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.11, 3.2.12
    • Component/s: WiredTiger
    • Labels:
      None
    • Environment:
      FreeBSD 11, sharded, replicated.
    • ALL
    • Hide

      Configure a large cache, perform many insert+update operations, measuring throughput.

      Show
      Configure a large cache, perform many insert+update operations, measuring throughput.

      Hi all,

      We are currently benchmarking our workload against the following test environment:

      • MongoDB 3.2.12 (also tested on 3.2.11)
      • Sharded cluster (2 shards, pre-chunked, 1 shard has 1 node and the other has 2 nodes in a replica set)
      • Each node has 2x20 core CPUs and 32GB of memory
      • WiredTiger storage engine
      • We are using 2 mongos

      Our synthetic workload consists of 450 workers performing an insert, waiting for 1 to 100ms and then updating the same document (adding roughly 1k bytes) querying by the _id field. We routinely see throughput drops to 0 for both inserts and updates.

      We've been collecting various stats while running the workload generator, and have correlated the drops to 0 throughput with a declining TrackedDirtyBytesInTheCache delta (measured every second using db.serverStatus()) which is what led us to believe it was related to the cache eviction.

      During periods of 0 throughput, the mongod boxes become CPU bound (40 core boxes) - this is what led us to conclude mongo was stalling as opposed to anything on application side. See the attached graph which is a plot of throughput vs. mongod CPU usage.

      After resizing the wiredtiger cache from 19GB to 1GB (with wiredTiger.engineConfig.cacheSizeGB=1) the throughput drops are much shorter in duration but still noticeable. We also tested setting eviction_dirty_trigger=10 and eviction_dirty_target=7 (with a 19GB cache) as suggested in https://jira.mongodb.org/browse/SERVER-28488 but we still see periods of 0 throughput.

      I have uploaded diagnostic data from a mongod instance previously under ticket #28488, and below are a couple of links to various stats we've recorded during various (different) testing runs in the hope they help. I can run tests and provide any info needed over the next couple of weeks. Sorry for the dupe ticket, but it has been over a month without a response.

      To clarify, our workload generator is the only application running on a 40 CPU core box, connected via a 1Gbps LAN to mongos - it performs no IO other than talking to mongo, it's sole purpose is to load test mongo. Monitoring the network interface, we never come remotely close to maxing the link, and we're nowhere near maxing the CPU of the host at any point - we are not I/O bound as claimed in the other ticket.

      CPU + disk IO stats: https://docs.google.com/a/itsallbroken.com/spreadsheets/d/1o6N23E_Nc1dBnFC_XXs6qWAXcdKAGjlgRuGHuVEXzn0/edit?usp=sharing

      db.serverStatus() + TrackedDirtyBytesInTheCache plot (switch sheets): https://docs.google.com/a/itsallbroken.com/spreadsheets/d/1tsK8vOun4IHyfhpA7mXQ9GrO9zneoWvL2dENj2EImug/edit?usp=sharing


        1. diagnostic.data_mongo3213.tar
          123.77 MB
        2. mongo1.png
          mongo1.png
          543 kB
        3. mongo2.png
          mongo2.png
          778 kB
        4. output_mongo3213_trim.log
          73 kB
        5. run2-cache-activity.png
          run2-cache-activity.png
          420 kB

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            domodwyer Dom Dwyer
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: