[SERVER-23798] Increased ns file IO in 3.0 Created: 19/Apr/16  Updated: 06/Dec/22  Resolved: 14/Sep/18

Status: Closed
Project: Core Server
Component/s: MMAPv1
Affects Version/s: 3.0.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Greg Murphy Assignee: Backlog - Storage Execution Team
Resolution: Won't Fix Votes: 0
Labels: mmapv1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-24824 Mongo 3.0.12 with MMAPv1 can't serve ... Closed
Related
Assigned Teams:
Storage Execution
Operating System: ALL
Steps To Reproduce:

Create a MongoDB 2.6 instance using MMAPv1 with enough databases that the cumulative size of their ns files is greater than available physical memory on the server.

Monitor the filesystem cache usage and disk IO on the server.

Upgrade this server to MongoDB 3.0 (still using MMAPv1) and monitor the same metrics.

Participants:

 Description   

Following upgrades from 2.6.9 to 3.0.9 (still using MMAPv1) we noticed significantly higher disk IO against the volume hosting MongoDB's data files.

This has become particularly apparent on replica sets with large numbers of databases (multiple thousands).

From investigation, this appears to be caused by a change in MongoDB's behaviour when reading ns files.

To give a precise example, we have a replica set that is currently in the process of being upgraded. It has 3 x 2.6.9 nodes and 1 x 3.0.9 node (hidden, non-voting).

The replica set has 5570 databases and uses the 16MB default ns size. If MongoDB loaded all of these ns files into memory, it would require 87GB of memory.

The existing 2.6.9 nodes run comfortably as EC2 r3.larges (14GB RAM), and running vmtouch shows that only a tiny percentage of the pages of the ns files are loaded into the filesystem cache:

# ./vmtouch -v /var/lib/mongodb/*.ns | tail -5
 
           Files: 5570
     Directories: 0
  Resident Pages: 188549/22814720  736M/87G  0.826%
         Elapsed: 0.97846 seconds

However, running the 3.0.9 node as an r3.large makes it unusable, as the filesystem cache is constantly flooded with the ns files (and the server takes 1hr 26 mins to start):

# ./vmtouch -v /var/lib/mongodb/*.ns | tail -5
 
           Files: 5570
     Directories: 0
  Resident Pages: 2905047/22814720  11G/87G  12.7%
         Elapsed: 0.67599 seconds

The server is then constantly performing significant amounts of read IO, I presume to keep trying to retain the entire contents of the ns files in memory:

# iostat -x 1 xvdg
Linux 3.13.0-77-generic (SERVER) 	04/19/2016 	_x86_64_	(2 CPU)
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.43    0.06    2.26   46.98    0.62   46.65
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdg              0.28     1.57 2185.88   21.08 33805.04   521.00    31.11     2.68    1.21    0.80   43.97   0.43  94.96
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          18.75    0.00    3.12   40.62    0.00   37.50
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdg              0.00     1.00 2430.00   73.00 37996.00   480.00    30.74     1.72    0.69    0.68    0.99   0.35  88.40
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.28    0.00    3.14   45.03    0.00   45.55
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdg              0.00     0.00 2285.00    0.00 35184.00     0.00    30.80     1.65    0.72    0.72    0.00   0.40  92.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.57    0.00    3.66   45.55    0.52   48.69
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdg              0.00    81.00 2525.00  136.00 40132.00 16740.00    42.74     9.04    3.40    0.64   54.56   0.36  95.60

Changing the instance type to an r3.4xlarge (122GB) alleviates the problem, as there is now enough memory for all of the ns files to be constantly loaded (and the server starts in 35 minutes with the IO subsystem being the limiting factor):

# ./vmtouch -v /var/lib/mongodb/*.ns | tail -5
 
           Files: 5572
     Directories: 0
  Resident Pages: 22822912/22822912  87G/87G  100%
         Elapsed: 0.94295 seconds

This isn't a feasible option for us though, as the cost of one of the r3.4xlarge instances is $1,102 for a 31 day month compared to $137 for an r3.large instance. (And clearly across a 3-node replica set this is a lot of money).



 Comments   
Comment by Abhishek Amberkar [ 22/Nov/17 ]

Thank you Kelsey,

Setting smaller --nssize fixed the issue for us.

Comment by Kelsey Schubert [ 19/May/17 ]

Hi abhishek.amberkar and gregmurphy,

The 'Backlog' fixVersion indicates that this issue is not currently scheduled for an upcoming release. We understand the impact of this behavior on your deployments, and have discussed this issue internally. Depending on your schema design, you may not need the default 16MB namespace and would benefit from calculating a smaller --nssize to mitigate the impact of this issue.

Kind regards,
Thomas

Comment by Greg Murphy [ 02/May/17 ]

I'm afraid I'm just the user who reported it, so don't have any insight into when it will be resolved.

Comment by Abhishek Amberkar [ 02/May/17 ]

@Thomas, @Greg

Is there any positive development on this issue?

Comment by Abhishek Amberkar [ 06/Mar/17 ]

Hi Thomas,

Is this issue being worked upon still?

Comment by Kelsey Schubert [ 03/Jan/17 ]

Hi abhishek.amberkar and raghusiddarth,

Unfortunately, we were not able to complete this work for it to be included in MongoDB 3.4. We're currently in the planning phase for our next major release and will update this ticket's fixVersion as part of this process.

Kind regards,
Thomas

Comment by Abhishek Amberkar [ 15/Dec/16 ]

Hi Thomas,

Is there any update on this issue?

Comment by Raghu Udiyar [ 09/Nov/16 ]

Hi Thomas, Can you let us know the status on this? I see that mongo 3.4 has been released, does that release address this issue?

Comment by Kelsey Schubert [ 02/May/16 ]

Hi gregmurphy,

Sorry for the silence, I have reproduced this issue and observed that with a large number of databases on MMAPv1 MongoDB starts up slower in 3.2 and 3.0 than 2.6. Please continue to watch this ticket for updates.

Kind regards,
Thomas

Comment by Greg Murphy [ 02/May/16 ]

Hopefully this ticket being put in the 3.3 backlog means the issue has been reproduced.

To reiterate (and hopefully increase priority), the combination of this issue and the one I've raised in SERVER-23433 means that post-2.6, MongoDB can't realistically be run in production with instances that have large amounts of databases and collections/indexes.

I believe this to be a significant area of concern regarding MongoDB's scalability. Of course if MongoDB isn't designed to support this kind of workload there should be documentation to make users aware that there is a limit to the amount of collections/indexes that can be created when using WiredTiger, and that there is a significant memory impact when running a large amount of databases when using MMAPv1.

Generated at Thu Feb 08 04:04:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.