[SERVER-20336] O(N^2) perf regression in listCollections and similar code paths [BLOCKING Mongo 3.0 Adoption] Created: 09/Sep/15  Updated: 23/Oct/15  Resolved: 09/Sep/15

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.0.5
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Michael Lehenbauer Assignee: Ramon Fernandez Marina
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File listCollectionsGrowth.png    
Issue Links:
Duplicate
duplicates SERVER-18624 listCollections command should not be... Closed
Related
Operating System: ALL
Steps To Reproduce:

Run the following script using a mongo3 shell against a mongo3 mongod. I've reproduced against 3.0.4, 3.0.5, and 3.0.6.

conn = new Mongo();
db = conn.getDB("dummy");
var N = 16000;
 
print('Creating collections');
for(var i = 0; i < N; i++) {
  db.createCollection('collection' + i);
}
 
print('Listing collections.');
var start = Date.now();
db.getCollectionNames();
print('Time to list collections: ', Date.now() - start);

You should see a time of ~15-30 seconds for the getCollectionNames() call, and it gets much worse as N increases, since it's O(N^2). If you run the same script on mongo 2.6, it will complete in < a second, even for large values of N.

You'll also see a hang if you restart your mongod server, or do a mongodump, and probably many other operations.

While db.getCollectionNames() is in progress, any writes will be blocked.

Participants:

 Description   

Mongo 3 has an O(N^2) perf issue where N is the number of collections in a database. This is a regression from 2.x. For our dataset this causes a ~15 minutes hang, making mongo 3 completely unusable.

The hang can be hit in many ways, including:
1. When calling db.getCollectionNames().
2. When starting mongod.
3. When doing a mongodump.
4. When a secondary mongod server in a replica set transitions to be primary.

The O(N^2) nature can be clearly seen by this chart showing measured time to perform a db.getCollectionNames() for a given number of collections. There's also an attached graph showing a quadratic best fit.

Number Collections (in 1000s) list collections time (seconds)
1 0.164
2 0.464
4 1.6
8 6.1
16 24
32 95
64 439

Context:

  • We have a multi-tenant system, where each tenant is served by a new collection.
  • As a result, we have on the order of 100,000 collections in a database.
  • We started upgrading to mongo 3 in our production environment, but ran into this issue (fortunately before we upgraded the primary) and had to do an emergency rollback to 2.6.
  • We're now stuck on 2.6 for the moment, but extremely eager to get the benefits of mongo 3 to address pressing issues in production.

Can you please acknowledge this bug and provide an estimate for when it can be fixed and released?



 Comments   
Comment by Ramon Fernandez Marina [ 23/Oct/15 ]

Thanks for reporting back katfang, glad to hear you're no longer seeing listCollections performance issues on MMAPv1 after SERVER-18624 was fixed.

Regards,
Ramón.

Comment by Katherine Fang [ 21/Oct/15 ]

Hi Ramon,

Just following up. We've tested out list collections with 3.0.7 and it seems much faster.
Our cluster which used to have a list collections time of 25291ms is now 217ms.
The cluster that used to see minutes of downtime during a list collection now comes back in 330ms.

Thanks again for the fix.

Comment by Michael Lehenbauer [ 08/Oct/15 ]

Thanks Ramon! I haven't gotten a chance to take a look (bit swamped at the moment), but will let you know once I do. Thanks for getting the fix through!

-Michael

Comment by Ramon Fernandez Marina [ 06/Oct/15 ]

mikelehen@google.com, this is to let you know that we've released release candidate 3.0.7-rc0 today, which includes a fix for this issue (see SERVER-18624). Could you please try out 3.0.7-rc0 and confirm that the performance issues of listCollections have been addressed?

Thanks,
Ramón.

Comment by Michael Lehenbauer [ 11/Sep/15 ]

Thanks Ramon! That will do nicely for us.

Thanks for following up,
-Michael

Comment by Ramon Fernandez Marina [ 10/Sep/15 ]

mikelehen, after internal discussion we've scheduled SERVER-18624 for versions 3.1.9 and 3.0.7. You can find more information about the tentative release dates for MongoDB versions here. Please watch SERVER-18624 if you're interested in further updates.

If I understand correctly this issue only affects the MMAPv1 storage engine, so one option you may consider is switching to the WiredTiger storage engine offering, among other features, data compression, which may be of interest for multi-tenant users.

Regards,
Ramón.

Comment by Michael Lehenbauer [ 09/Sep/15 ]

Thanks ramon.fernandez. That bug seems to be 4 months old and currently unassigned, yet this is a blocking issue for us. Can you clarify the timeline for which we could expect to see this fixed in the 3.0 branch?

Comment by Ramon Fernandez Marina [ 09/Sep/15 ]

Thanks for the additional information mikelehen. We're aware of the behavior you describe and SERVER-18624 is open to fix it. We're aiming for a fix on the current development cycle; once a fix is ready we'll evaluate the impact of the backport to the v3.0 branch.

I'm going to mark this ticket as a duplicate of SERVER-18624; feel free to vote for SERVER-18624 and watch it for updates.

Regards,
Ramón.

Comment by Michael Lehenbauer [ 09/Sep/15 ]

We're using mmap. Sorry for the omission.

Comment by Ramon Fernandez Marina [ 09/Sep/15 ]

mikelehen, what storage engine are you using in 3.0?

Generated at Thu Feb 08 03:53:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.