[SERVER-26979] Degraded performance in 3.2.10 Created: 10/Nov/16  Updated: 21/Jun/17  Resolved: 24/May/17

Status: Closed
Project: Core Server
Component/s: Performance, Replication, WiredTiger
Affects Version/s: 3.2.5, 3.2.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Arndt Königsfeld Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File eviction-1.png     PNG File replication_lag.png    
Operating System: ALL
Steps To Reproduce:

Upgrade MongoDB to 3.2.10 from 3.2.1, or Upgrade MongoDB to 3.2.5 from 3.2.1

Participants:

 Description   

In version 3.2.10 we had major replication and perfomance problems.

Symptoms:

  • Replication too slow to get in sync
  • Overall poor performance
Details:

After upgrading from 3.2.1 to 3.2.10 we encountered massive replication problems (oplog last event time was changing very slow, seconds behind master increasing).
We have a simple 2-node Replication Setup (+ 1 arbiter).
We updated mongo slave first and the repliction was still fine. (Slave: 3.2.10, Master: 3.2.1)
After switching slave to master we encountered above described problems. (Slave: 3.2.1, Master: 3.2.10) We upgraded slave and it was not getting better. (Slave: 3.2.10, Master: 3.2.10)
We encountered similar problems with 3.2.5. some month ago.
After searching the bug reports we found SERVER-24775 which sounds familiar.

Workload:

  • we have 150-170GB oplog per hour
  • insert Statements between 800-20.000 per minute
  • read 70-600 per minute.

cat /etc/mongod.conf
# mongod.conf
 
storage:
  dbPath: /mongo/mongodb
  engine: wiredTiger
 
systemLog:
  destination: file
  logAppend: true
  path: /mongo/log/mongod.log
  quiet: true
 
replication:
  replSetName: rs0
  oplogSizeMB: 1024000

Node Hardware:
RAM: 378G
 
Storage: SSD
 
Architecture:               x86_64
CPU op-mode(s):       32-bit, 64-bit
Byte Order:                Little Endian
CPU(s):                      40
On-line CPU(s) list:    0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):                   2
NUMA node(s):          2
Vendor ID:                  GenuineIntel
CPU family:                6
Model:                        62
Stepping:                   4
CPU MHz:                 2493.704
BogoMIPS:                4988.68
Virtualization:             VT-x
L1d cache:                32K
L1i cache:                 32K
L2 cache:                  256K
L3 cache:                  25600K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39}}

Problem:

2016-11-09T10:18:48.657+0100 I COMMAND  [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679650000|537, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:612 keyUpdates:0 writeConflicts:0 numYields:21 nreturned:612 reslen:4196209 locks:{ Global: { acquireCount: { r: 44 } }, Database: { acquireCount: { r: 22 } }, oplog: { acquireCount: { r: 22 } } } protocol:op_command 4281ms
2016-11-09T10:18:51.886+0100 I COMMAND  [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679650000|1102, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:600 keyUpdates:0 writeConflicts:0 numYields:21 nreturned:600 reslen:4197017 locks:{ Global: { acquireCount: { r: 44 } }, Database: { acquireCount: { r: 22 } }, oplog: { acquireCount: { r: 22 } } } protocol:op_command 3220ms
2016-11-09T10:18:57.626+0100 I COMMAND  [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679650000|1714, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:343 keyUpdates:0 writeConflicts:0 numYields:15 nreturned:343 reslen:2421780 locks:{ Global: { acquireCount: { r: 32 } }, Database: { acquireCount: { r: 16 } }, oplog: { acquireCount: { r: 16 } } } protocol:op_command 5723ms
2016-11-09T10:19:01.426+0100 I COMMAND  [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679650000|2314, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:520 keyUpdates:0 writeConflicts:0 numYields:14 nreturned:520 reslen:4200801 locks:{ Global: { acquireCount: { r: 30 } }, Database: { acquireCount: { r: 15 } }, oplog: { acquireCount: { r: 15 } } } protocol:op_command 3794ms
2016-11-09T10:19:02.278+0100 I COMMAND  [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679651000|195, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:606 keyUpdates:0 writeConflicts:0 numYields:11 nreturned:606 reslen:4198613 locks:{ Global: { acquireCount: { r: 24 } }, Database: { acquireCount: { r: 12 } }, oplog: { acquireCount: { r: 12 } } } protocol:op_command 828ms
2016-11-09T10:19:07.764+0100 I COMMAND  [conn24795] command local.oplog.rs command: getMore { getMore: 14390826614, collection: "oplog.rs", maxTimeMS: 5000, term: 193, lastKnownCommittedOpTime: { ts: Timestamp 1478679651000|715, t: 193 } } planSummary: COLLSCAN cursorid:14390826614 keysExamined:0 docsExamined:467 keyUpdates:0 writeConflicts:0 numYields:17 nreturned:467 reslen:2913761 locks:{ Global: { acquireCount: { r: 36 } }, Database: { acquireCount: { r: 18 } }, oplog: { acquireCount: { r: 18 } } } protocol:op_command 5470ms }}

Normal:

2016-11-10T13:09:48.011+0100 I COMMAND  [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779752000|14364, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:6 nreturned:638 reslen:4199468 locks:{ Global: { acquireCount: { r: 14 } }, Database: { acquireCount: { r: 7 } }, oplog: { acquireCount: { r: 7 } } } protocol:op_command 135ms
2016-11-10T13:09:48.860+0100 I COMMAND  [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779753000|7831, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:12 nreturned:1571 reslen:4195266 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } }, oplog: { acquireCount: { r: 13 } } } protocol:op_command 206ms
2016-11-10T13:09:49.302+0100 I COMMAND  [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779753000|20919, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:10 nreturned:932 reslen:4197637 locks:{ Global: { acquireCount: { r: 22 } }, Database: { acquireCount: { r: 11 } }, oplog: { acquireCount: { r: 11 } } } protocol:op_command 397ms
2016-11-10T13:09:49.604+0100 I COMMAND  [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779753000|30727, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:11 nreturned:1302 reslen:4198994 locks:{ Global: { acquireCount: { r: 24 } }, Database: { acquireCount: { r: 12 } }, oplog: { acquireCount: { r: 12 } } } protocol:op_command 120ms
2016-11-10T13:09:49.971+0100 I COMMAND  [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779754000|8261, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:8 nreturned:828 reslen:4196761 locks:{ Global: { acquireCount: { r: 18 } }, Database: { acquireCount: { r: 9 } }, oplog: { acquireCount: { r: 9 } } } protocol:op_command 180ms
2016-11-10T13:09:50.654+0100 I COMMAND  [conn41400] command local.oplog.rs command: getMore { getMore: 16956787344741, collection: "oplog.rs", maxTimeMS: 5000, term: 194, lastKnownCommittedOpTime: { ts: Timestamp 1478779754000|18849, t: 194 } } cursorid:16956787344741 keyUpdates:0 writeConflicts:0 numYields:9 nreturned:473 reslen:4201156 locks:{ Global: { acquireCount: { r: 20 } }, Database: { acquireCount: { r: 10 } }, oplog: { acquireCount: { r: 10 } } } protocol:op_command 341ms

Replication lag graph:



 Comments   
Comment by Kelsey Schubert [ 24/May/17 ]

Hi ubik,

I'm resolving this issue given the fixes mentioned above. If this is still an issue for you after upgrading, please let us know and we will continue to investigate.

Kind regards,
Thomas

Comment by Kelsey Schubert [ 04/May/17 ]

Hi ubik,

We haven't heard back from you in some time, if this is still an issue for you, would you please upgrade to the latest version of MongoDB 3.2 or 3.4, which contain the fixes mentioned above and let us know if it resolves the problem?

Thank you,
Thomas

Comment by Kelsey Schubert [ 28/Mar/17 ]

Hi ubik,

Today, we released MongoDB 3.4.3. This release includes a number of improvements to WiredTiger. In particular, I believe the following tickets are likely to significantly improve the behavior you are observing:

  • WT-3115 improve concurrency for accessing the list of trees in cache
  • WT-3139 enhances our workload generator to stress individual tables
  • WT-3148 improves eviction efficiency with many idle trees in cache
  • WT-3150 reduces the impact of checkpoints on eviction efficiency

Users in SERVER-27700 and SERVER-27753 have seen significant improvements after upgrading to take advantage of these code changes, and our internal benchmarking has also shown promising results. In addition, MongoDB 3.4.3 captures additional diagnostic information that helps us more quickly identify the root cause of issues like this.

Would you please upgrade to MongoDB 3.4.3 and confirm if it resolves the issue?

Thank you,
Thomas

Comment by Arndt Königsfeld [ 11/Nov/16 ]

Hi Thomas,
files uploaded

Comment by Kelsey Schubert [ 10/Nov/16 ]

Hi ubik,

Certainly, I've created a secure portal for you to use. Files uploaded to this portal are visible only to MongoDB employees investigating this issue and are routinely deleted after some time.

I would also like to clarify that the diagnostic.data does not contain any user data. For more information, please take a look at DOCS-7899 for a list of commands and feel free to review the relevant code.

Kind regards,
Thomas

Comment by Arndt Königsfeld [ 10/Nov/16 ]

Hi Dan,

can we upload the data to a private and secure location?

Comment by Daniel Pasette (Inactive) [ 10/Nov/16 ]

Hi Arndt,
Can you attach the files located in the diagnostic.data directory in your dbpath to this ticket?

Generated at Thu Feb 08 04:13:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.