Degraded performance in 3.2.10

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: 3.2.5, 3.2.10
    • None
    • ALL
    • Hide

      Upgrade MongoDB to 3.2.10 from 3.2.1, or Upgrade MongoDB to 3.2.5 from 3.2.1

      Show
      Upgrade MongoDB to 3.2.10 from 3.2.1, or Upgrade MongoDB to 3.2.5 from 3.2.1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      In version 3.2.10 we had major replication and perfomance problems.

      Symptoms:

      • Replication too slow to get in sync
      • Overall poor performance
      Details:

      After upgrading from 3.2.1 to 3.2.10 we encountered massive replication problems (oplog last event time was changing very slow, seconds behind master increasing).
      We have a simple 2-node Replication Setup (+ 1 arbiter).
      We updated mongo slave first and the repliction was still fine. (Slave: 3.2.10, Master: 3.2.1)
      After switching slave to master we encountered above described problems. (Slave: 3.2.1, Master: 3.2.10) We upgraded slave and it was not getting better. (Slave: 3.2.10, Master: 3.2.10)
      We encountered similar problems with 3.2.5. some month ago.
      After searching the bug reports we found SERVER-24775 which sounds familiar.

      Workload:

      • we have 150-170GB oplog per hour
      • insert Statements between 800-20.000 per minute
      • read 70-600 per minute.
      cat /etc/mongod.conf
      # mongod.conf
      
      storage:
        dbPath: /mongo/mongodb
        engine: wiredTiger
      
      systemLog:
        destination: file
        logAppend: true
        path: /mongo/log/mongod.log
        quiet: true
      
      replication:
        replSetName: rs0
        oplogSizeMB: 1024000
      
      Node Hardware:
      RAM: 378G
      
      Storage: SSD
      
      Architecture:               x86_64
      CPU op-mode(s):       32-bit, 64-bit
      Byte Order:                Little Endian
      CPU(s):                      40
      On-line CPU(s) list:    0-39
      Thread(s) per core:    2
      Core(s) per socket:    10
      Socket(s):                   2
      NUMA node(s):          2
      Vendor ID:                  GenuineIntel
      CPU family:                6
      Model:                        62
      Stepping:                   4
      CPU MHz:                 2493.704
      BogoMIPS:                4988.68
      Virtualization:             VT-x
      L1d cache:                32K
      L1i cache:                 32K
      L2 cache:                  256K
      L3 cache:                  25600K
      NUMA node0 CPU(s):     0-9,20-29
      NUMA node1 CPU(s):     10-19,30-39}}
      
      Problem:
      2016-11-09T10:18:48.657+0100 I COMMAND  [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679650000|537, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:612 keyUpdates:0 writeConflicts:0 numYields:21 nreturned:612 reslen:4196209 locks:{ Global: { acquireCount: { r: 44 } }, Database: { acquireCount: { r: 22 } }, oplog: { acquireCount: { r: 22 } } } protocol:op_command 4281ms
      2016-11-09T10:18:51.886+0100 I COMMAND  [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679650000|1102, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:600 keyUpdates:0 writeConflicts:0 numYields:21 nreturned:600 reslen:4197017 locks:{ Global: { acquireCount: { r: 44 } }, Database: { acquireCount: { r: 22 } }, oplog: { acquireCount: { r: 22 } } } protocol:op_command 3220ms
      2016-11-09T10:18:57.626+0100 I COMMAND  [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679650000|1714, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:343 keyUpdates:0 writeConflicts:0 numYields:15 nreturned:343 reslen:2421780 locks:{ Global: { acquireCount: { r: 32 } }, Database: { acquireCount: { r: 16 } }, oplog: { acquireCount: { r: 16 } } } protocol:op_command 5723ms
      2016-11-09T10:19:01.426+0100 I COMMAND  [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679650000|2314, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:520 keyUpdates:0 writeConflicts:0 numYields:14 nreturned:520 reslen:4200801 locks:{ Global: { acquireCount: { r: 30 } }, Database: { acquireCount: { r: 15 } }, oplog: { acquireCount: { r: 15 } } } protocol:op_command 3794ms
      2016-11-09T10:19:02.278+0100 I COMMAND  [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679651000|195, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:606 keyUpdates:0 writeConflicts:0 numYields:11 nreturned:606 reslen:4198613 locks:{ Global: { acquireCount: { r: 24 } }, Database: { acquireCount: { r: 12 } }, oplog: { acquireCount: { r: 12 } } } protocol:op_command 828ms
      2016-11-09T10:19:07.764+0100 I COMMAND  [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679651000|715, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:467 keyUpdates:0 writeConflicts:0 numYields:17 nreturned:467 reslen:2913761 locks:{ Global: { acquireCount: { r: 36 } }, Database: { acquireCount: { r: 18 } }, oplog: { acquireCount: { r: 18 } } } protocol:op_command 5470ms }}
      
      Normal:
      2016-11-10T13:09:48.011+0100 I COMMAND  [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779752000|14364, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:6 nreturned:638 reslen:4199468 locks:{ Global: { acquireCount: { r: 14 } }, Database: { acquireCount: { r: 7 } }, oplog: { acquireCount: { r: 7 } } } protocol:op_command 135ms
      2016-11-10T13:09:48.860+0100 I COMMAND  [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779753000|7831, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:12 nreturned:1571 reslen:4195266 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } }, oplog: { acquireCount: { r: 13 } } } protocol:op_command 206ms
      2016-11-10T13:09:49.302+0100 I COMMAND  [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779753000|20919, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:10 nreturned:932 reslen:4197637 locks:{ Global: { acquireCount: { r: 22 } }, Database: { acquireCount: { r: 11 } }, oplog: { acquireCount: { r: 11 } } } protocol:op_command 397ms
      2016-11-10T13:09:49.604+0100 I COMMAND  [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779753000|30727, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:11 nreturned:1302 reslen:4198994 locks:{ Global: { acquireCount: { r: 24 } }, Database: { acquireCount: { r: 12 } }, oplog: { acquireCount: { r: 12 } } } protocol:op_command 120ms
      2016-11-10T13:09:49.971+0100 I COMMAND  [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779754000|8261, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:8 nreturned:828 reslen:4196761 locks:{ Global: { acquireCount: { r: 18 } }, Database: { acquireCount: { r: 9 } }, oplog: { acquireCount: { r: 9 } } } protocol:op_command 180ms
      2016-11-10T13:09:50.654+0100 I COMMAND  [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779754000|18849, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:9 nreturned:473 reslen:4201156 locks:{ Global: { acquireCount: { r: 20 } }, Database: { acquireCount: { r: 10 } }, oplog: { acquireCount: { r: 10 } } } protocol:op_command 341ms
      

      Replication lag graph:

        1. eviction-1.png
          eviction-1.png
          329 kB
        2. replication_lag.png
          replication_lag.png
          13 kB

            Assignee:
            Kelsey Schubert
            Reporter:
            Arndt Königsfeld
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: