[SERVER-22923] iowait increases slightly after upgrading from MongoDB 3.0 to 3.2 Created: 02/Mar/16  Updated: 15/Sep/16  Resolved: 14/Sep/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Maziyar Panahi Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2016-03-02 at 12.35.49.png     PNG File Screen Shot 2016-03-02 at 12.36.11.png     PNG File Screen Shot 2016-03-21 at 22.01.58.png     PNG File cache_activity.png     PNG File cpu_stats.png     Zip Archive diagnostic.data-primary.zip     Zip Archive diagnostic.data-secondary.zip     Zip Archive diagnostic.data.zip     Text File iostat-primary.log     Text File iostat-secondary.log     Zip Archive iostat30.log.zip     Zip Archive iostat32.log.zip     PNG File screenshot-cloud.mongodb.com-2016-09-15-12-30-25.png     Zip Archive ss.log.zip    
Operating System: ALL
Participants:

 Description   
Original summary

iowait, softirq and ssytem CPU usage drastically changed from upgrading MongoDB 3.0 to 3.2

Original description

Hello,

Last night I upgraded my MongoDB 3.0.9 to 3.2 in a rolling upgrade and it went ok. But I have noticed the CPU usage both on the machines and MMS that some parameters have changed drastically. It is noticeable from 2 screenshots I attached (before upgrade and after)

It is also noticeable when I am SSH to the machine every 20-30 seconds it freezes for a sec or two.

Details:

Machines:
2x VM Ubuntu 14.0.4 LTS
80GB RAM and 10 core of CPU

Running replica:
mongod --storageEngine wiredTiger --dbpath /data/ --replSet rs0 --fork --logpath /data/fork.log

MongoDB shell version: 3.2.3
rs0:PRIMARY> db.stats()
{
        "db" : "test",
        "collections" : 77,
        "objects" : 3766991607,
        "avgObjSize" : 782.8079483321238,
        "dataSize" : 2948830971260,
        "storageSize" : 1500816666624,
        "numExtents" : 0,
        "indexes" : 205,
        "indexSize" : 49011482624,
        "ok" : 1
}

Some mongostat --discover

 localhost:27017   *149     3    *21     *0       0    13|0     0.5   21.9       0 11.1G 10.4G   0|0   0|1    2k   130k  102 rs0  SEC 2016-03-02T12:42:10+01:00
mongodb-replica1:27017    182    *0     23     *0     134    84|0     0.4   21.3       0 10.8G 10.0G   0|0   0|0  289k   323k  112 rs0  PRI 2016-03-02T12:42:11+01:00
mongodb-replica2:27017   *149     3    *21     *0       0    13|0     0.5   21.9       0 11.1G 10.4G   0|0   0|0    2k   130k  102 rs0  SEC 2016-03-02T12:42:10+01:00
 
       localhost:27017    *98     6    *22     *0       0    12|0     0.5   21.9       0 11.1G 10.4G   0|0   0|0    2k   235k  102 rs0  SEC 2016-03-02T12:42:11+01:00
mongodb-replica1:27017     94    *0     36     *0     129    77|0     0.4   21.3       0 10.8G 10.0G   0|0   0|0  189k   200k  112 rs0  PRI 2016-03-02T12:42:12+01:00
mongodb-replica2:27017    *98     6    *22     *0       0    11|0     0.5   21.9       0 11.1G 10.4G   0|0   0|0    2k   217k  102 rs0  SEC 2016-03-02T12:42:11+01:00
 
       localhost:27017    *78    10    *30     *0       0     3|0     0.5   21.9       0 11.1G 10.4G   0|0   0|0    3k   376k  102 rs0  SEC 2016-03-02T12:42:12+01:00
mongodb-replica1:27017     72    *0     31     *0     113    64|0     0.5   21.3       0 10.8G 10.0G   1|0   0|0  153k   163k  112 rs0  PRI 2016-03-02T12:42:13+01:00
mongodb-replica2:27017    *78    10    *30     *0       0     3|0     0.5   21.9       0 11.1G 10.4G   0|0   0|0    3k   376k  102 rs0  SEC 2016-03-02T12:42:12+01:00
 
       localhost:27017    *76     6    *24     *0       0     5|0     0.5   21.9       0 11.1G 10.4G   0|0   0|0    2k   215k  102 rs0  SEC 2016-03-02T12:42:13+01:00
mongodb-replica1:27017     69    *0      4     *0      93    44|0     0.5   21.3       0 10.8G 10.0G   0|0   0|1   98k   115k  112 rs0  PRI 2016-03-02T12:42:14+01:00
mongodb-replica2:27017    *75     6    *24     *0       0     5|0     0.5   21.9       0 11.1G 10.4G   0|1   0|0    2k   215k  102 rs0  SEC 2016-03-02T12:42:13+01:00
 
       localhost:27017    *79     3     *4     *0       0     3|0     0.5   21.9       0 11.1G 10.4G   0|0   0|0  900b   107k  102 rs0  SEC 2016-03-02T12:42:14+01:00
mongodb-replica1:27017     73    *0     17     *0     102    52|0     0.5   21.3       0 10.8G 10.0G   0|0   0|0  136k   149k  112 rs0  PRI 2016-03-02T12:42:15+01:00
mongodb-replica2:27017    *80     3     *4     *0       0     4|0     0.5   21.9       0 11.1G 10.4G   0|0   0|0  979b   126k  102 rs0  SEC 2016-03-02T12:42:14+01:00
 
       localhost:27017    *69     6    *24     *0       0     7|0     0.5   21.9       0 11.1G 10.4G   0|0   0|0    2k   215k  102 rs0  SEC 2016-03-02T12:42:15+01:00
mongodb-replica1:27017     76    *0     18     *0     107    59|0     0.5   21.3       0 10.8G 10.0G   0|0   0|0  133k   667k  112 rs0  PRI 2016-03-02T12:42:16+01:00
mongodb-replica2:27017    *69     6    *24     *0       0     8|0     0.5   21.9       0 11.1G 10.4G   0|0   0|0    2k   215k  102 rs0  SEC 2016-03-02T12:42:15+01:00
 
       localhost:27017    *80     6    *11     *0       0    13|0     0.5   22.0       0 11.1G 10.4G   0|0   0|0    2k   552k  102 rs0  SEC 2016-03-02T12:42:16+01:00
mongodb-replica1:27017     76    *0     18     *0     107    59|0     0.5   21.3       0 10.8G 10.0G   0|0   0|0  133k   667k  112 rs0  PRI 2016-03-02T12:42:16+01:00
mongodb-replica2:27017    *80     6    *11     *0       0    12|0     0.0   22.0       0 11.1G 10.4G   0|0   0|0    2k   552k  102 rs0  SEC 2016-03-02T12:42:16+01:00
 
       localhost:27017    *84     3    *16     *0       0     5|0     0.5   22.0       0 11.1G 10.4G   0|0   0|0    1k   145k  102 rs0  SEC 2016-03-02T12:42:17+01:00
mongodb-replica1:27017     84    *0     14     *0     101    54|0     0.5   21.3       0 10.8G 10.0G   0|0   0|0  140k   154k  112 rs0  PRI 2016-03-02T12:42:17+01:00
mongodb-replica2:27017    *84     3    *16     *0       0     4|0     0.5   22.0       0 11.1G 10.4G   0|0   0|0  989b   127k  102 rs0  SEC 2016-03-02T12:42:17+01:00
 
       localhost:27017    *11     5     *7     *0       0     3|0     0.5   22.0       1 11.1G 10.4G   0|1   4|0  243b    19k  102 rs0  SEC 2016-03-02T12:42:19+01:00
mongodb-replica1:27017    124    *0     29     *0      35    16|0     0.5   21.3       0 10.8G 10.0G   0|0   0|0  168k   113k  112 rs0  PRI 2016-03-02T12:42:19+01:00
mongodb-replica2:27017    *11     5     *7     *0       0     4|0     0.5   22.0       1 11.1G 10.4G   0|1   4|0  322b    37k  102 rs0  SEC 2016-03-02T12:42:19+01:00
 
       localhost:27017   *184    10    *37     *0       0     4|0     0.4   22.0       0 11.1G 10.4G   0|0   0|0    4k   480k  102 rs0  SEC 2016-03-02T12:42:20+01:00
mongodb-replica1:27017     92    *0     17     *0     100    45|0     0.5   21.3       0 10.8G 10.0G   0|0   0|0  135k   229k  112 rs0  PRI 2016-03-02T12:42:20+01:00
mongodb-replica2:27017   *184    10    *37     *0       0     5|0     0.4   22.0       0 11.1G 10.4G   0|0   0|0    4k   480k  102 rs0  SEC 2016-03-02T12:42:20+01:00

Please let me know if you need more info.

Best,
Maziyar



 Comments   
Comment by Kelsey Schubert [ 14/Sep/16 ]

As Ramon explained, a lot of work has been completed to improve the performance of MongoDB and WiredTiger in 3.2. For example, on secondaries we expect that there may be an increase in I/O load as part of the way we sync journal files, this load should self-throttle to keep performance from degrading. From our analysis, the increase of iowait observed in this ticket is minor and does not impact performance. Therefore, I'm closing this ticket.

Kind regards,
Thomas

Comment by Ramon Fernandez Marina [ 21/Jun/16 ]

After looking at the data we haven't found anything abnormal, and the increase in iowait is really minimal. There was a significant amount of work that went into WiredTiger in 3.2, so it will be very hard to pinpoint the exact change that caused this behavior.

Since we haven't seen any negative impact from this increase in iowait over all I'm lowering the priority of this ticket, but will keep it open to attempt to reproduce this behavior locally and investigate further as time permits.

Comment by Ilya Skriblovsky [X] [ 11/May/16 ]

> As I understand, the increase in iowait is the only issue you have observed
Yes. Other symptoms (increased Load Average, increased disk utilization) seem to be caused by the common root.

I've attached files that you have requested.

For both 3.0 and 3.2 I've ran scripts after more than 6 hours of operation, when mongod reached it's usual resident memory size.

Comment by Kelsey Schubert [ 10/May/16 ]

Hi IlyaSkriblovsky,

Thank you for reporting your observations. As I understand, the increase in iowait is the only issue you have observed when MongoDB was upgraded to 3.2.

To continue to investigate this behavior, we will need additional information. Would you please follow the steps below?

With MongoDB 3.0 running:
Can you please execute the following shell script:

delay=1
mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000*$delay)}" >ss.log &
iostat -k -t -x ${delay:?} >iostat30.log &

After running this script for an hour, please attach the following to this ticket

  • iostat30.log
  • ss.log

With MongoDB 3.2 running:
Can you please execute the following shell script:

delay=1
iostat -k -t -x ${delay:?} >iostat32.log &

This will collect iostat data each second, and will help us to correlate the CPU numbers to events recorded in the diagnostic.data.

After running this script for an hour, please attach the following to this ticket

  • iostat32.log
  • an archive (tar or zip) of the $dbpath/diagnostic.data directory

Thank you,
Thomas

Comment by Ramon Fernandez Marina [ 06/May/16 ]

Reopening to take a closer look at recently uploaded data.

Comment by Ilya Skriblovsky [X] [ 06/May/16 ]

I've reverted back to 3.0.11 to see whether iowait will drop back to lower values.

Here are the results (be sure to view full images, they are wide and may be truncated):




So yes, iowait is low again. But Cache Activity graph doesn't became like it was before 3.0→3.2 upgrade. So, decreased cache activity is probably not related to a version of MongoDB, but simply my DB needs much more time to warm up the cache.

Fortunately, I can switch between 3.0 and 3.2 relatively easy. So Please let me know if some diagnostics info from 3.0 or 3.2 might be useful.

Please consider reopening this ticket because higher iowait certainly needs some explanation (or better fixing!).

Comment by Ilya Skriblovsky [X] [ 06/May/16 ]

Hi all,

I have an exactly same effect after an upgrade from 3.0.11 to 3.2.6:



Increased iowait is seen not only at Cloud Manager's graphs, but also at "wa" percentage in `top` and at load average that doubled after an upgrade.

Primary configuration:
Debian 8 (kernel 3.16.0), 8 cores of i7-3770, 16GB RAM, HDD, mongod is running inside docker container

rs0:PRIMARY> db.stats()
{
        "db" : "autogps",
        "collections" : 44,
        "objects" : 180558321,
        "avgObjSize" : 1580.2443436710955,
        "dataSize" : 285326265463,
        "storageSize" : 99755470848,
        "numExtents" : 0,
        "indexes" : 74,
        "indexSize" : 4974911488,
        "ok" : 1
}

diagnostics.data.zip: http://microline.ru/img/mongodb-jira/diagnostic.data.zip
iostat.log: http://microline.ru/img/mongodb-jira/iostat.log

Even if MongoDB 3.2 works as expected, it seems like considerable issue and I thinking of downgrading back to 3.0 due to this effect.

May be this is caused by some caching policy change of 3.2? Are there any relevant configuration parameters that can be tuned?

Comment by Kelsey Schubert [ 06/Apr/16 ]

Hi maziyar,

From the data you have uploaded we do not see anything to indicate a bug in MongoDB. When you account for the 10 cores on your machine, iowait peaked at about 3% of total cpu activity. This value is falls within reasonable operating levels.

The log you have provided indicates that you may be have run into SERVER-23003. A fix for this issue will be included in MongoDB 3.2.5. Please upgrade to 3.2.5 when it is released and open a new ticket if you encounter this assertion again.

Thank you,
Thomas

Comment by Maziyar Panahi [ 21/Mar/16 ]

I forgot to mention, after upgrade the secondary just dies with this error and I have to start the mongd again. Not sure if this error is related to the iowait or the fact that I always see jbd2/vdb-8 (iotop) in my mongodb instance with over 40% IO:

2016-03-21T22:49:17.832+0100 I REPL     [rsBackgroundSync] rollback 3 fixup
2016-03-21T22:49:18.835+0100 I REPL     [rsBackgroundSync] rollback 3.5
2016-03-21T22:49:18.836+0100 I REPL     [rsBackgroundSync] rollback 4 n:11
2016-03-21T22:49:18.836+0100 I REPL     [rsBackgroundSync] minvalid=(term: 19, timestamp: Mar 21 22:49:42:20)
2016-03-21T22:49:18.837+0100 I REPL     [rsBackgroundSync] rollback 4.6
2016-03-21T22:49:18.837+0100 I REPL     [rsBackgroundSync] rollback 4.7
2016-03-21T22:49:19.060+0100 I NETWORK  [initandlisten] connection accepted from 10.0.0.6:45008 #183934 (3 connections now open)
2016-03-21T22:49:19.064+0100 I REPL     [rsBackgroundSync] rollback 5 d:1379 u:2
2016-03-21T22:49:19.064+0100 I REPL     [rsBackgroundSync] rollback 6
2016-03-21T22:49:19.071+0100 I REPL     [rsBackgroundSync] rollback done
2016-03-21T22:49:19.071+0100 I NETWORK  [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 10.0.0.5:27017
2016-03-21T22:49:19.073+0100 I REPL     [ReplicationExecutor] transition to RECOVERING
2016-03-21T22:49:19.080+0100 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 10.0.0.5:27017
2016-03-21T22:49:19.080+0100 I REPL     [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 10.0.0.5:27017, will retry
2016-03-21T22:49:19.080+0100 I REPL     [rsBackgroundSync] rollback finished
2016-03-21T22:49:19.082+0100 I REPL     [ReplicationExecutor] could not find member to sync from
2016-03-21T22:49:19.831+0100 I REPL     [ReplicationExecutor] transition to SECONDARY
2016-03-21T22:49:20.082+0100 I REPL     [ReplicationExecutor] syncing from: mongodb-replica2:27017
2016-03-21T22:49:20.091+0100 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to mongodb-replica2:27017
2016-03-21T22:49:20.117+0100 I -        [rsSync] Fatal assertion 34361 OplogOutOfOrder: Attempted to apply an earlier oplog entry (ts: Mar 21 22:49:16:25) when our lastWrittenOptime was (term: 18, timestamp: Mar 21 22:49:16:26)
2016-03-21T22:49:20.158+0100 I CONTROL  [rsSync]
 0x12f3502 0x12917d8 0x127e772 0xf2ad78 0xf208d0 0x7f5218fa1a60 0x7f52187be182 0x7f52184eb47d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"EF3502","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"E917D8","s":"_ZN5mongo10logContextEPKc"},{"b":"400000","o":"E7E772","s":"_ZN5mongo23fassertFailedWithStatusEiRKNS_6StatusE"},{"b":"400000","o":"B2AD78","s":"_ZN5mongo4repl8SyncTail16oplogApplicationEv"},{"b":"400000","o":"B208D0","s":"_ZN5mongo4repl13runSyncThreadEv"},{"b":"7F5218EF0000","o":"B1A60"},{"b":"7F52187B6000","o":"8182"},{"b":"7F52183F1000","o":"FA47D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.4", "gitVersion" : "e2ee9ffcf9f5a94fad76802e28cc978718bb7a30", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.13.0-79-generic", "version" : "#123-Ubuntu SMP Fri Feb 19 14:27:58 UTC 2016", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "EF46210F8976780D45B811C3540FECB9E734EABE" }, { "b" : "7FFF06E0C000", "elfType" : 3, "buildId" : "E37313376D77AA284BA26F3EFE65368125B2020B" }, { "b" : "7F52199DC000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "E21720F2804EF30440F2B39CD409252C26F58F73" }, { "b" : "7F5219600000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "9BC22F9457E3D7E9CF8DDC135C0DAC8F7742135D" }, { "b" : "7F52193F8000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "B376100CAB1EAC4E5DE066EACFC282BF7C0B54F3" }, { "b" : "7F52191F4000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "67699FFDA9FD2A552032E0652A242E82D65AA10D" }, { "b" : "7F5218EF0000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "D0E735DBECD63462DA114BD3F76E6EC7BB1FACCC" }, { "b" : "7F5218BEA000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "EF3F6DFFA1FBE48436EC6F45CD3AABA157064BB4" }, { "b" : "7F52189D4000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "36311B4457710AE5578C4BF00791DED7359DBB92" }, { "b" : "7F52187B6000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "AF06068681750736E0524DF17D5A86CB2C3F765C" }, { "b" : "7F52183F1000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "5382058B69031CAA9B9996C11061CD164C9398FF" }, { "b" : "7F5219C3B000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "2A816C3EBBA4E12813FBD34B06FBD25BC892A67F" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12f3502]
 mongod(_ZN5mongo10logContextEPKc+0x138) [0x12917d8]
 mongod(_ZN5mongo23fassertFailedWithStatusEiRKNS_6StatusE+0x62) [0x127e772]
 mongod(_ZN5mongo4repl8SyncTail16oplogApplicationEv+0x1658) [0xf2ad78]
 mongod(_ZN5mongo4repl13runSyncThreadEv+0x2B0) [0xf208d0]
 libstdc++.so.6(+0xB1A60) [0x7f5218fa1a60]
 libpthread.so.0(+0x8182) [0x7f52187be182]
 libc.so.6(clone+0x6D) [0x7f52184eb47d]
-----  END BACKTRACE  -----
2016-03-21T22:49:20.159+0100 I -        [rsSync]
 
***aborting after fassert() failure

Many thanks.

Comment by Maziyar Panahi [ 21/Mar/16 ]

Hi Ramon,

Yes the iowait still up. I uploaded a new screenshot.

Thanks again Ramon,

Best,
Maziyar

Comment by Ramon Fernandez Marina [ 14/Mar/16 ]

maziyar, if I understand the data correctly, it seems that the CPU stats climbed up initially, and then everything went back to normal except for iowait:

Interestingly enough there's less data going into the cache:

I'm still looking at the data you uploaded to see if this change in behavior can be explained from that.

Comment by Maziyar Panahi [ 04/Mar/16 ]

iostats from both primary and secondary plus the diagnostics data from both mongodb instances

Comment by Maziyar Panahi [ 04/Mar/16 ]

Hi Thomas,

Thanks for the reply. I am running the script and will upload iostat for both instances plus the diagnostic directory.

Thanks again,
Maziyar

Comment by Kelsey Schubert [ 02/Mar/16 ]

maziyar, please also run the following the shell script:

delay=1
iostat -k -t -x ${delay:?} >iostat.log &

This will collect iostat data each second, and will help us to correlate the CPU numbers to events recorded in the diagnostic.data.

After running this script for an hour, please upload both the iostat.log and diagnostic.data to this ticket.

Thanks again,
Thomas

Comment by Kelsey Schubert [ 02/Mar/16 ]

Hi maziyar,

MongoDB 3.2 introduced a diagnostic data collection mechanism that logs server statistics at periodic intervals. To get a better idea of what is going on, can you please archive $dbpath/diagnostic.data directory and attach it to this ticket?

Thank you,
Thomas

Generated at Thu Feb 08 04:01:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.