[SERVER-24712] Mongod intermittently consuming all available CPU. Created: 22/Jun/16  Updated: 14/Jul/16  Resolved: 30/Jun/16

Status: Closed
Project: Core Server
Component/s: Admin
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Zachary Rollyson Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File diagnostic.data.tgz     PNG File findAndModify_workload.png     PNG File findAndModify_workload_mmapv1.png     Text File iostat.log     Text File iostat2.log     Text File iostat3.log     Text File mongostat.log     Text File mongostat2.log     Text File serverStatusAfter.log     Text File serverStatusDuring.log     Text File ss.log    
Operating System: ALL
Participants:

 Description   

This is a new issue with a deployment created with Cloud Manager. The server was provisioned on AWS through CM's provisioning tool. It was added to a replica set as secondary so that the old primary could subsequently be shut down. I've noticed that once every ~7 seconds, mongod will consume all CPU for about ~10 seconds. I will attach logs of mongostat, iostat & db.serverStatus() during and after a spike.



 Comments   
Comment by Kelsey Schubert [ 13/Jul/16 ]

Hi zach@izimobile.com,

I'd also like to mention that the behavior of findAndModify differs between versions of MongoDB and storage engines in MongoDB. In MongoDB 3.2.0, findAndModifies are retried if a WriteConflictException is encountered (SERVER-21434). In MongoDB 3.2.3, this modification was reverted for the MMAPv1 storage engine (SERVER-22002). To determine if this difference in behavior may be impacting what we are observing in this case, would you please answer the following questions?

  1. Are these findAndModify commands coming from multiple clients?
  2. Do any these findAndModifies include a sort?
  3. Would you provide a sample findAndModify command?

Thank you,
Thomas

Comment by Zachary Rollyson [ 30/Jun/16 ]

Thanks Thomas.

Comment by Kelsey Schubert [ 30/Jun/16 ]

Hi zach@izimobile.com,

Thanks for uploading the ss.log. It confirms my previous observations - every 18 seconds a findAndModify workload is executed.

MMAPv1 and WiredTiger have very different performance characteristics especially regarding cpu. I would recommend either allocating additional resources or modifying your application layer to distribute load if the cpu bottleneck is a problem for you. As I mentioned before, the best place for continued discussion would be the mongodb-users group .

Kind regards,
Thomas

Comment by Zachary Rollyson [ 30/Jun/16 ]

P.S.

Just remembered I had meant to mention that the two unaffected DBs are standalones and the spiking one is a single member replica set.

Comment by Zachary Rollyson [ 29/Jun/16 ]

HI Thomas,

The server with the spiking CPU is on 3.2.7 with wiredTiger. The two servers that are unaffected are running 3.0.12 with mmapv1.

I've attached to files you asked for as iostat3.log & ss.log.

// db.adminCommand({getCmdLineOpts: true});
{
	"argv" : [
		"/usr/bin/mongod",
		"--config",
		"/etc/mongod.conf"
	],
	"parsed" : {
		"config" : "/etc/mongod.conf",
		"net" : {
			"port" : 27017
		},
		"security" : {
			"authorization" : "enabled"
		},
		"storage" : {
			"dbPath" : "/mongo_data",
			"journal" : {
				"enabled" : true
			},
			"mmapv1" : {
				"smallFiles" : true
			}
		},
		"systemLog" : {
			"destination" : "file",
			"logAppend" : true,
			"path" : "/var/log/mongodb/mongod.log"
		}
	},
	"ok" : 1
}

// db.adminCommand({buildInfo: true});
{
	"version" : "3.0.12",
	"gitVersion" : "33934938e0e95d534cebbaff656cde916b9c3573",
	"OpenSSLVersion" : "OpenSSL 1.0.1f 6 Jan 2014",
	"sysInfo" : "Linux ip-10-229-66-2 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49",
	"versionArray" : [
		3,
		0,
		12,
		0
	],
	"loaderFlags" : "",
	"compilerFlags" : "-Wnon-virtual-dtor -Woverloaded-virtual -std=c++11 -fno-omit-frame-pointer -fPIC -fno-strict-aliasing -ggdb -pthread -Wall -Wsign-compare -Wno-unknown-pragmas -Winvalid-pch -pipe -Werror -O3 -Wno-unused-local-typedefs -Wno-unused-function -Wno-deprecated-declarations -Wno-unused-but-set-variable -Wno-missing-braces -fno-builtin-memcmp -std=c99",
	"allocator" : "tcmalloc",
	"javascriptEngine" : "V8",
	"bits" : 64,
	"debug" : false,
	"maxBsonObjectSize" : 16777216,
	"ok" : 1
}

// db.adminCommand({hostInfo: true});
{
	"system" : {
		"currentTime" : ISODate("2016-06-29T15:16:35.647Z"),
		"hostname" : "ip-172-31-55-29",
		"cpuAddrSize" : 64,
		"memSizeMB" : 3952,
		"numCores" : 2,
		"cpuArch" : "x86_64",
		"numaEnabled" : false
	},
	"os" : {
		"type" : "Linux",
		"name" : "Ubuntu",
		"version" : "14.04"
	},
	"extra" : {
		"versionString" : "Linux version 3.13.0-68-generic (buildd@lgw01-46) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #111-Ubuntu SMP Fri Nov 6 18:17:06 UTC 2015",
		"libcVersion" : "2.19",
		"versionSignature" : "Ubuntu 3.13.0-68.111-generic 3.13.11-ckt27",
		"kernelVersion" : "3.13.0-68-generic",
		"cpuFrequencyMHz" : "2400.106",
		"cpuFeatures" : "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm xsaveopt fsgsbase bmi1 avx2 smep bmi2 erms invpcid",
		"pageSize" : NumberLong(4096),
		"numPages" : 1011719,
		"maxOpenFiles" : 64000
	},
	"ok" : 1
}

Thanks again,
Zach

Comment by Kelsey Schubert [ 29/Jun/16 ]

Hi zach@izimobile.com,

Thank you for the additional information. Are the two unaffected servers both running with same configuration (MongoDB version, storage engine)?

Since your unaffected node is running MongoDB 3.0.12 and does not have full time diagnostic capture, I have modified the previous shell script to manually collect the data relevant to this issue.

  1. Please execute the following script:

    delay=1
    mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc: true}))); sleep(1000*$delay)}" >ss.log &
    iostat -k -t -x $delay >iostat.log &
    

  2. When you have collected the data for 15 minutes, terminate the data collection processes, for example:

    killall mongo iostat
    

  3. Please also provide host and configuration details by running the following commands from the mongo shell:

    db.adminCommand({getCmdLineOpts: true});
    db.adminCommand({buildInfo: true});
    db.adminCommand({hostInfo: true});
    

Please attach ss.log and iostat.log as well as the output of the mongo commands listed in the final step to this ticket.

Thank you,
Thomas

Comment by Zachary Rollyson [ 28/Jun/16 ]

P.S.

I also just uploaded ~5min of mongostat, should that be useful.

Comment by Zachary Rollyson [ 28/Jun/16 ]

Hi Thomas,

I've just uploaded iostat2.log from an unaffected server. This one is using MongoDB v3.0.12, however, and has no diagnostic.data. Is there something else that could be useful?

Thanks,
Zach

Comment by Kelsey Schubert [ 23/Jun/16 ]

Hi zach@izimobile.com,

If the same application is running on other servers, would you please execute the following shell script on a server that isn't affected?

delay=1
iostat -k -t -x ${delay:?} >iostat.log &

This will collect iostat data each second, and will help us to correlate the CPU numbers to events recorded in the diagnostic.data, and help us compare against the behavior we are observing on the affected server.

After running this script for 15 minutes, please upload the following from the unaffected server:

  • iostat.log
  • diagnostic.data

Thank you,
Thomas

Comment by Zachary Rollyson [ 23/Jun/16 ]

I have this running on three servers and the only one with an issue is this one set up by CM.

Comment by Kelsey Schubert [ 23/Jun/16 ]

Hi zach@izimobile.com,

Sorry for the delay, I've had a chance to examine the diagnostic.data you have provided, and I do not see anything to indicate a bug in the MongoDB server. I would suggest investigating your application layer to determine whether the behavior you are observing is originating there. It appears that every 18 seconds a findAndModify workload is executed for 13 seconds:

For MongoDB-related support discussion please post on the mongodb-users group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group. See also our Technical Support page for additional support resources.

Kind regards,
Thomas

Comment by Zachary Rollyson [ 23/Jun/16 ]

Status for this has been "waiting for user input" for nearly 24 hours, anything I'm missing here?

Comment by Zachary Rollyson [ 22/Jun/16 ]

Hello Ramón,

Just uploaded diagnostic.data.tgz. If gzip won't work for you let me know and I'll upload some other format.

Thanks,
Zach

Comment by Ramon Fernandez Marina [ 22/Jun/16 ]

Can you please upload the contents of the diagnostic.data directory in your dbpath?

Thanks,
Ramón.

Comment by Zachary Rollyson [ 22/Jun/16 ]

To also note, this deployment is running with automation, monitoring and backup agents.

Generated at Thu Feb 08 04:07:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.