Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24824

Mongo 3.0.12 with MMAPv1 can't serve more than 1k qps

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.0.12
    • Component/s: MMAPv1, Performance
    • None
    • ALL

      Hello All,

      The setup
      I am running a replica set in production without sharding. All nodes are running latest stable mongo 2.6 except one hidden node which has mongo 3.0 with MMAPv1.

      The data
      I have around 4TB worth of data on each node (MMAPv1), with close to 7000 databases.

      The plan
      I decided to upgrade to 3.2 and as an intermediate step, I have to upgrade to 3.0 first. Initially I had used wiredTiger with this node, but encountered a problem when I sent prod traffic to that node. Here's the full description on the JIRA issue - SERVER-24514. For not being blocked on the issue, I decided to go ahead with 3.0.12 with MMAPv1 instead of wiredTiger. So to start with that, I added the aforementioned hidden member to the existing replica set. I started sending prod like read query traffic to this node to check if it will be able to withstand that much load. I did this for over a week.

      The plan was to roll out 3.0 on all secondaries if latencies and rps are close to prod like pattern.

      The observation
      It was observed that the node couldn't entertain the read traffic at a consistent rate. It can be seen that the node entertains ~1k queries per seconds, shoots to ~3.5k qps for a very brief moment of time and drops back to ~1k qps. (Please check Operations Per Second graph). This particular pattern is not observed when same traffic is sent to 2.6.x nodes. Those nodes can entertain same traffic at ~4k qps consistently.
      In the process of understanding what exactly is happening, I ran db.currentOp() on that node. I didn't find anything in particular. But some queries took ~200ms to return. For those queries "timeAcquiringMicros" is ~198ms. According to the docs it is the "cumulative time in microseconds that the operation had to wait to acquire the locks". I would appreciate any help here.

        1. network_activity.png
          network_activity.png
          31 kB
        2. ops_per_second.png
          ops_per_second.png
          38 kB
        3. pagefaults_per_sec.png
          pagefaults_per_sec.png
          31 kB
        4. mongo_resident_mem.png
          mongo_resident_mem.png
          44 kB
        5. mongo44.log
          7 kB
        6. host_info2.txt
          2 kB
        7. host_info3.txt
          2 kB
        8. iostat2.log
          3.63 MB
        9. iostat3.log
          3.49 MB
        10. ss2.log
          2.24 MB
        11. ss3.log
          19.60 MB

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            abhishek.amberkar Abhishek Amberkar
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: