Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-20185

Scaling issue at high connection count with journal enabled under WiredTiger

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 3.0.6, 3.1.7
    • Fix Version/s: 3.1.8
    • Component/s: WiredTiger
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL

      Description

      • 6 cores, 64 GB memory (everything fits in cache)
      • test below issues 10 inserts/s per connection and ramps connections up to 10k connections for a total expected throughput of 100k inserts/s
      • measured max throughput at small connection count was 300k/s without journal, 200k/s with journal, so this test, with a maximum expected throughput of only 100k/s, does not tax total capacity of system but rather probes the effect of high connection counts at relatively low op rates per connection
      • 3.0.6 build used is actually 3.0.6 + fixes for SERVER-20091

      • expected scaling is achieve without journal (green)
      • under 3.0.6 with journal enabled only 25% of expected throughput is reached; this is consistent run to run (red)
      • in 3.1 50-75% of expected throughput is reached, but there is a striking run-to-run variability (yellow, blue, purple)

      Repro code:

      function conns() {
          return db.serverStatus().connections.current
      }
       
      function ops() {
          return db.serverStatus().opcounters.insert
      }
       
      function repro(threads_insert) {
       
          // run forever
          seconds = 10000
       
          // starting stats
          last_conns = curr_conns = start_conns = conns()
          last_time = new Date()
          last_ops = ops()
       
          // loop starting new connections
          while (curr_conns < start_conns+threads_insert) {
       
              // start 10 more insert threads with a random delay around 100ms (10 inserts/second/thread)
              res = benchStart({
                  ops: [{
                      op: "insert",
                      ns: "test.c",
                      doc: {},
                      delay: NumberInt(100 + Math.random()*10-5)
                  }]
                  seconds: seconds,
                  parallel: 10,
              })
       
              // 10 new connections every 100ms
              sleep(100)
       
              // print op rate vs connections
              curr_conns = conns()
              if (curr_conns-last_conns >= 100) {
                  curr_time = new Date()
                  curr_ops = ops()
                  ops_per_sec = Math.round((curr_ops - last_ops) / ((curr_time - last_time) / 1000.0))
                  avg_conns = (last_conns+curr_conns) / 2
                  print('' + avg_conns + '\t' + ops_per_sec)
                  last_time = curr_time
                  last_ops = curr_ops
                  last_conns = curr_conns
              }
          }
       
          // run forever
          sleep(seconds*1000)
      }
      

        Attachments

        1. journal-scaling.png
          journal-scaling.png
          25 kB
        2. new-journal.png
          new-journal.png
          15 kB

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: