[SERVER-33537] High write IO after 3.6 upgrade Created: 28/Feb/18  Updated: 28/Feb/18  Resolved: 28/Feb/18

Status: Closed
Project: Core Server
Component/s: Write Ops
Affects Version/s: 3.6.3
Fix Version/s: None

Type: Question Priority: Minor - P4
Reporter: Tarvi Pillessaar Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: SWKB
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

hw: AWS c4.2xlarge
operating system: Ubuntu xenial
kernel: 4.4.0-92-generic #115-Ubuntu SMP Thu Aug 10 09:04:33 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
mongodb volume: 130GiB io1 1000 IOPS

conf:
storage:
dbPath: /data/data
journal:
enabled: true
engine: wiredTiger

systemLog:
destination: file
logAppend: true
path: /data/log/mongod.log

net:
port: 27017
bindIp: 0.0.0.0

replication:
oplogSizeMB: 512
replSetName: foo

operationProfiling:
slowOpThresholdMs: 1
mode: off


Attachments: PNG File Screen Shot 2018-02-28 at 10.03.46.png     PNG File Screen Shot 2018-02-28 at 10.07.47.png     PNG File Screen Shot 2018-02-28 at 10.25.08.png     PNG File Screen Shot 2018-02-28 at 10.25.32.png     File diagnostic.data.tar     PNG File flushes.png    
Issue Links:
Duplicate
duplicates SERVER-31679 Increase in disk i/o for writes to re... Closed
Participants:

 Description   

Yesterday I upgraded one of our 3.4 mongodb cluster to 3.6 (3.4.9 -> 3.6.3).
After upgrade I noticed increased write IO (IOPS and throughput).
Added also 2 screenshots, upgrade was done at 5:20.

When looking with iotop, I can see WTOplog.lThread is writing a lot, does it mean that 3.6 writes more oplog? Is it something that is expected or is it regression?



 Comments   
Comment by Bruce Lucas (Inactive) [ 28/Feb/18 ]

Hi Tarvi,

Thanks for the data. Here's what we can see:

The rate of write operations before and after the upgrade is the same, but after the upgrade we see a much higher rate of log (journal) flushes. This is symptomatic of SERVER-31679, and typically affects workloads with a low degree of write concurrency, which appears to be the case here.

Thanks for your report. A fix for this is in progress and will be in a future version of 3.6. I'll close this ticket as a duplicate of SERVER-31679; please watch that ticket for further updates.

Thanks,
Bruce

Comment by Tarvi Pillessaar [ 28/Feb/18 ]

Hi Bruce,

Thanks you for quick response. Uploaded diagnostic.data, removed older metrics files to reduce archive size, hopefully it is okay.

Tarvi

Comment by Bruce Lucas (Inactive) [ 28/Feb/18 ]

Hi Tarvi,

It is possible that you're encountering SERVER-31679. Can you please archive and upload the $dbpath/diagnostic.data directory from an affected node so we can confirm?

Thanks,
Bruce

Comment by Tarvi Pillessaar [ 28/Feb/18 ]

Added 2 additional screenshots with longer timeframe.

Generated at Thu Feb 08 04:33:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.