[SERVER-26788] Running MongoDB on machines with multiple physical cpus Created: 26/Oct/16  Updated: 16/Nov/21  Resolved: 29/Oct/16

Status: Closed
Project: Core Server
Component/s: Concurrency
Affects Version/s: 3.2.10
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Tyler Brock Assignee: Geert Bosch
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File htop.png    
Participants:

 Description   

Hey everyone, we have a production replica set consisting of three nodes that has been running well for two years. It looks like this from a configuration standpoint:

  • 3 m4.2xlarge instances (8 cores)
  • xfs filesystem for data drives
  • ebs volumes (with 8000 provisioned iops – max for instance type)
  • wired tiger storage engine
  • ssl, auth, etc

The performance is great but we want to scale up our nodes to handle a potential spike in usage over the next two weeks. As we are not particularly i/o bound given our usage of MongoDB and appear to be largely cpu bound on these boxes (from what I can tell) we have transitioned these nodes from m4.2xlarge (8 cores) to m4.4xlarge (16 cores).

To my surprise it appears as though mongod is only using the first 8 (0-7) of the 16 cores available on this machine. Now, I realize that:

  • In going from 8 to 16 cores we may now have two physical cpus backing our instance
  • taskset and cpuset can be used to set core/processor affinity and I do not believe they are in use (we are using the init script from the Amazon linux package)
  • numactl should specify that memory usage be interleaved instead of preferring a node or physical cpu (again, confirmed via package init script)
  • Using `htop` as a view onto cpu usage on virtualized hardware is a potentially flawed metric for various reasons

So I have the question: Why is mongod appearing to use only 8 of 16 cores available on these boxen?

It's possible that the linux scheduler doesn't bother scheduling tasks onto the second physical cpu until there are a greater number of running threads so as to take advantage of CPU caches? Right now there is not much load to speak of so that is my current running theory.

Having never run mognod on a machine having multiple physical cpus in production before I'm only guessing as to what the issue may be. Any clues as to what I might be seeing 10geneers?



 Comments   
Comment by Geert Bosch [ 28/Oct/16 ]

Glad to know things are working for you Tyler! I had looked up the m2.4xlarge on AWS, but it is only listed as previous generation option. The main page explicitly refers to the previous generation page for m2 instances. I hope your upgraded system performs according to expectations!

Comment by Tyler Brock [ 28/Oct/16 ]

Yup, it turns out that once things got cranking the second processor kicked into gear. Thanks again Geert.

Comment by Tyler Brock [ 26/Oct/16 ]

Thanks Geert! I'll try that out when traffic dies down tonight but m2.4xlarge definitely has 16vCPUs and M4.2xlarge definitely has 8vCPUs, check it out: http://www.ec2instances.info/?cost_duration=monthly&selected=m4.2xlarge,m4.4xlarge

Comment by Geert Bosch [ 26/Oct/16 ]

There should not be any special configuration required for MongoDB to use all available cores / hyper threads and performance should generally scale well to 16 cores. It looks as if m2.4xlarge is listed as providing only 8 vCpus, while m2.2xlarge has 4vCpus. Not sure why it's showing up in top the way it does.
You can run:

for (i = 0; i < 1000; i++)
    db.benchrun.update({_id: i}, {upsert:1});
 
printjson(benchRun({
    ops: [{
        ns: "db.benchrun",
        op: "update",
        query: {_id: {"#RAND_INT": [0, 1000]}},
        update: {$inc: {x: 1}}
    }],
    parallel: 16,
    seconds: 20,
    host: db.getMongo().host
}));
db.benchrun.drop();

to provide an artificial load that should show all CPUs busy. You might want to check /proc/cpuinfo to see if the number of listed cpus matches expectations.

Generated at Thu Feb 08 04:13:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.