|
In version 3.2.10 we had major replication and perfomance problems.
Symptoms:
- Replication too slow to get in sync
- Overall poor performance
Details:
After upgrading from 3.2.1 to 3.2.10 we encountered massive replication problems (oplog last event time was changing very slow, seconds behind master increasing).
We have a simple 2-node Replication Setup (+ 1 arbiter).
We updated mongo slave first and the repliction was still fine. (Slave: 3.2.10, Master: 3.2.1)
After switching slave to master we encountered above described problems. (Slave: 3.2.1, Master: 3.2.10) We upgraded slave and it was not getting better. (Slave: 3.2.10, Master: 3.2.10)
We encountered similar problems with 3.2.5. some month ago.
After searching the bug reports we found SERVER-24775 which sounds familiar.
Workload:
- we have 150-170GB oplog per hour
- insert Statements between 800-20.000 per minute
- read 70-600 per minute.
cat /etc/mongod.conf
|
# mongod.conf
|
|
storage:
|
dbPath: /mongo/mongodb
|
engine: wiredTiger
|
|
systemLog:
|
destination: file
|
logAppend: true
|
path: /mongo/log/mongod.log
|
quiet: true
|
|
replication:
|
replSetName: rs0
|
oplogSizeMB: 1024000
|
Node Hardware:
|
RAM: 378G
|
|
Storage: SSD
|
|
Architecture: x86_64
|
CPU op-mode(s): 32-bit, 64-bit
|
Byte Order: Little Endian
|
CPU(s): 40
|
On-line CPU(s) list: 0-39
|
Thread(s) per core: 2
|
Core(s) per socket: 10
|
Socket(s): 2
|
NUMA node(s): 2
|
Vendor ID: GenuineIntel
|
CPU family: 6
|
Model: 62
|
Stepping: 4
|
CPU MHz: 2493.704
|
BogoMIPS: 4988.68
|
Virtualization: VT-x
|
L1d cache: 32K
|
L1i cache: 32K
|
L2 cache: 256K
|
L3 cache: 25600K
|
NUMA node0 CPU(s): 0-9,20-29
|
NUMA node1 CPU(s): 10-19,30-39}}
|
Problem:
2016-11-09T10:18:48.657+0100 I COMMAND [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679650000|537, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:612 keyUpdates:0 writeConflicts:0 numYields:21 nreturned:612 reslen:4196209 locks:{ Global: { acquireCount: { r: 44 } }, Database: { acquireCount: { r: 22 } }, oplog: { acquireCount: { r: 22 } } } protocol:op_command 4281ms
|
2016-11-09T10:18:51.886+0100 I COMMAND [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679650000|1102, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:600 keyUpdates:0 writeConflicts:0 numYields:21 nreturned:600 reslen:4197017 locks:{ Global: { acquireCount: { r: 44 } }, Database: { acquireCount: { r: 22 } }, oplog: { acquireCount: { r: 22 } } } protocol:op_command 3220ms
|
2016-11-09T10:18:57.626+0100 I COMMAND [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679650000|1714, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:343 keyUpdates:0 writeConflicts:0 numYields:15 nreturned:343 reslen:2421780 locks:{ Global: { acquireCount: { r: 32 } }, Database: { acquireCount: { r: 16 } }, oplog: { acquireCount: { r: 16 } } } protocol:op_command 5723ms
|
2016-11-09T10:19:01.426+0100 I COMMAND [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679650000|2314, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:520 keyUpdates:0 writeConflicts:0 numYields:14 nreturned:520 reslen:4200801 locks:{ Global: { acquireCount: { r: 30 } }, Database: { acquireCount: { r: 15 } }, oplog: { acquireCount: { r: 15 } } } protocol:op_command 3794ms
|
2016-11-09T10:19:02.278+0100 I COMMAND [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679651000|195, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:606 keyUpdates:0 writeConflicts:0 numYields:11 nreturned:606 reslen:4198613 locks:{ Global: { acquireCount: { r: 24 } }, Database: { acquireCount: { r: 12 } }, oplog: { acquireCount: { r: 12 } } } protocol:op_command 828ms
|
2016-11-09T10:19:07.764+0100 I COMMAND [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679651000|715, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:467 keyUpdates:0 writeConflicts:0 numYields:17 nreturned:467 reslen:2913761 locks:{ Global: { acquireCount: { r: 36 } }, Database: { acquireCount: { r: 18 } }, oplog: { acquireCount: { r: 18 } } } protocol:op_command 5470ms }}
|
Normal:
2016-11-10T13:09:48.011+0100 I COMMAND [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779752000|14364, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:6 nreturned:638 reslen:4199468 locks:{ Global: { acquireCount: { r: 14 } }, Database: { acquireCount: { r: 7 } }, oplog: { acquireCount: { r: 7 } } } protocol:op_command 135ms
|
2016-11-10T13:09:48.860+0100 I COMMAND [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779753000|7831, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:12 nreturned:1571 reslen:4195266 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } }, oplog: { acquireCount: { r: 13 } } } protocol:op_command 206ms
|
2016-11-10T13:09:49.302+0100 I COMMAND [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779753000|20919, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:10 nreturned:932 reslen:4197637 locks:{ Global: { acquireCount: { r: 22 } }, Database: { acquireCount: { r: 11 } }, oplog: { acquireCount: { r: 11 } } } protocol:op_command 397ms
|
2016-11-10T13:09:49.604+0100 I COMMAND [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779753000|30727, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:11 nreturned:1302 reslen:4198994 locks:{ Global: { acquireCount: { r: 24 } }, Database: { acquireCount: { r: 12 } }, oplog: { acquireCount: { r: 12 } } } protocol:op_command 120ms
|
2016-11-10T13:09:49.971+0100 I COMMAND [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779754000|8261, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:8 nreturned:828 reslen:4196761 locks:{ Global: { acquireCount: { r: 18 } }, Database: { acquireCount: { r: 9 } }, oplog: { acquireCount: { r: 9 } } } protocol:op_command 180ms
|
2016-11-10T13:09:50.654+0100 I COMMAND [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779754000|18849, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:9 nreturned:473 reslen:4201156 locks:{ Global: { acquireCount: { r: 20 } }, Database: { acquireCount: { r: 10 } }, oplog: { acquireCount: { r: 10 } } } protocol:op_command 341ms
|
Replication lag graph:

|