[SERVER-25883] High iowait, slow queries, lots of free memory Created: 31/Aug/16  Updated: 13/Sep/16  Resolved: 01/Sep/16

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 3.2.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Nic Cottrell (Personal) Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2016-08-31 at 12.52.39.png     PNG File Screen Shot 2016-08-31 at 15.15.19.png    
Operating System: ALL
Participants:

 Description   

As of a few days ago, one node is my cluster is suddenly generating many slow queries. The machine is showing very high iowait (sustained > 30%) but cpu of mongod never > 1%. The machine is showing 26GB free memory and 0 swap used. This is the only node sharing with a mysqld process, but I don't understand why mongod is not taking lots of memory like on all other nodes.



 Comments   
Comment by Kelsey Schubert [ 13/Sep/16 ]

Hi niccottrell,

Thank you for the update! I'm glad you were able to identify that a failed drive was the cause of this issue.

Please be aware that WiredTiger has no direct visibility of what's happening underneath when mdadm is operating in a degraded state. Therefore, WiredTiger cannot compensate for this issue.

Kind regards,
Thomas

Comment by Nic Cottrell (Personal) [ 09/Sep/16 ]

I disabled mysqld completely on this machine and the problem remained although eventually the memory usage of mongod crept up. I did in fact have a failed drive inside a mirrored software raid. It was marked as failed inside mdadm and so I don't see why that would cause iowait to jump from a few percent to high 90's. In any case, after replacing he drive the iowait dropped back to near zero. I can't help think there may still be a bug in wiredtiger which doesn't handle a failed software raid as well as it could.

Comment by Emil Burzo [ 08/Sep/16 ]

It's a big coincidence that I have the exact same issue after upgrading to MongoDB 3.2.9.

Graphs and description, here: https://groups.google.com/forum/#!starred/mongodb-user/42wOHqR_o5Q

Comment by Ramon Fernandez Marina [ 01/Sep/16 ]

I don't think there's enough evidence to indicate a bug in MongoDB, just that something is happening on your system that's slowing MongoDB down, so I'm going to close this ticket – as you know, the SERVER project is for reporting bugs or feature suggestions for the MongoDB server.

Looking at the output of pidstat I don't think the higher I/O coming from mysqld is the issue either, so my guess is that your md device may be rebuilding or doing a consistency check. I'd look at /proc/mdstat to find out more.

Note that for MongoDB-related support discussion you can post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. See also our Technical Support page for additional support resources.

Thanks,
Ramón.

Comment by Nic Cottrell (Personal) [ 01/Sep/16 ]

Doesn't seem like a lot of activity:

[root@lorette nic]# pidstat -d 1
Linux 2.6.32-642.4.2.el6.x86_64 (lorette.sprawk.com) 	09/01/2016 	_x86_64_	(8 CPU)
 
10:08:11 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
 
10:08:12 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
10:08:13 AM       710      0.00     12.00      0.00  jbd2/md2-8
10:08:13 AM      2245      0.00     12.00      0.00  mysqld
10:08:13 AM      2437     48.00      4.00      0.00  mongod
10:08:13 AM     15848      4.00      4.00      0.00  updatedb
 
10:08:13 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
10:08:14 AM      2245     24.00      4.00      0.00  mysqld
10:08:14 AM      2437     12.00      8.00      0.00  mongod
10:08:14 AM     15848     16.00      0.00      0.00  updatedb
 
10:08:14 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
 
10:08:15 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
10:08:16 AM       710      0.00      4.00      0.00  jbd2/md2-8
10:08:16 AM      2245     16.00     20.00      0.00  mysqld
10:08:16 AM      2437     12.00      0.00      0.00  mongod
 
10:08:16 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
 
10:08:17 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
10:08:18 AM       710      0.00      4.00      0.00  jbd2/md2-8
10:08:18 AM      2245      0.00     36.00      0.00  mysqld
10:08:18 AM      2437     12.00      8.00      0.00  mongod
 
10:08:18 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
10:08:19 AM      2245     16.00      0.00      0.00  mysqld
 
10:08:19 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
 
10:08:20 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
10:08:21 AM      2245      8.00     12.00      0.00  mysqld
10:08:21 AM      2437      0.00     32.00      0.00  mongod
10:08:21 AM     15848     24.00      0.00      0.00  updatedb
 
10:08:21 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
10:08:22 AM      2245     80.00    404.00      0.00  mysqld
10:08:22 AM      2437     16.00     16.00      0.00  mongod
 
10:08:22 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
10:08:23 AM      2536      0.00     20.00     12.00  munin-node
 
10:08:23 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
10:08:24 AM      2245     88.00    468.00      0.00  mysqld
10:08:24 AM      2437    160.00     24.00      0.00  mongod
10:08:24 AM     15848      8.00      0.00      0.00  updatedb
 
10:08:24 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
 
10:08:25 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
10:08:26 AM       710      0.00     12.00      0.00  jbd2/md2-8
10:08:26 AM      2245    456.00      4.00      0.00  mysqld
10:08:26 AM      2437     44.00     12.00      0.00  mongod

Comment by Ramon Fernandez Marina [ 01/Sep/16 ]

Is the I/O coming from mongod? Can you try running pidstat -d 1 and posting the output here?

Thanks,
Ramón.

Comment by Nic Cottrell (Personal) [ 31/Aug/16 ]

Now I'm seeing spikes of 100% cpu for the mongod process

Comment by Nic Cottrell (Personal) [ 31/Aug/16 ]

Often in the mongod.log:

2016-08-31T13:10:03.813+0200 I COMMAND  [conn55] serverStatus was very slow: { after basic: 0, after asserts: 0, after connections: 0, after extra_info: 0, after globalLock: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after oplog: 10530, after repl: 10530, after sharding: 10530, after storageEngine: 10530, after tcmalloc: 10530, after wiredTiger: 10530, at end: 10530 }
2016-08-31T13:10:03.813+0200 I COMMAND  [conn55] command local.oplog.rs command: serverStatus { serverStatus: 1, locks: 0, recordStats: 0, oplog: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:19933 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 10600738 } }, Database: { acquireCount: { r: 1 } }, oplog: { acquireCount: { r: 1 } } } protocol:op_query 10601ms

Comment by Nic Cottrell (Personal) [ 31/Aug/16 ]

This possibly started just after the 3.2.9-rc0 yum update but continues after the 3.2.9 upgrade and a machine restart.

uname -a gives:
Linux lorette.sprawk.com 2.6.32-642.4.2.el6.x86_64 #1 SMP Tue Aug 23 19:58:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Generated at Thu Feb 08 04:10:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.