[SERVER-20716] Slow listDatabases with WT Created: 01/Oct/15  Updated: 12/Nov/15  Resolved: 12/Nov/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Anthony Brodard Assignee: Unassigned
Resolution: Duplicate Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File dbv.log     File mongod.log     File ss.log    
Issue Links:
Related
related to SERVER-20961 Large amounts of create and drop coll... Closed
related to SERVER-21066 Investigate listDatabases option to r... Closed
related to SERVER-21067 Investigate having the storage size o... Closed
Operating System: Linux
Participants:

 Description   

On instance with more than 2k databases, a listDatabase take more than 11s :

2015-10-01T15:01:24.173+0000 I COMMAND  [conn3164] command admin.$cmd command: listDatabases { listDatabases: 1.0 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:189925 locks:{ Global: { acquireCount: { r: 5908 } }, Database: { acquireCount: { r: 2954 } } } 11609ms

Tested on several instances :

2015-10-01T15:02:41.537+0000 I COMMAND  [conn4713] command admin.$cmd command: listDatabases { listDatabases: 1.0 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:213799 locks:{ Global: { acquireCount: { r: 6626 } }, Database: { acquireCount: { r: 3313 } } } 13618ms
2015-10-01T15:03:10.178+0000 I COMMAND  [conn3400] command admin.$cmd command: listDatabases { listDatabases: 1.0 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:187243 locks:{ Global: { acquireCount: { r: 5826 } }, Database: { acquireCount: { r: 2913 } } } 10917ms
2015-10-01T15:03:47.616+0000 I COMMAND  [conn316099] command admin.$cmd command: listDatabases { listDatabases: 1.0 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:192677 locks:{ Global: { acquireCount: { r: 5996 } }, Database: { acquireCount: { r: 2998 } } } 11979ms

On another instance, running in 2.6 with mmapv1, listing about 8x more databases take less than 1s :

2015-10-01T15:07:56.751+0000 [conn224132] command admin.$cmd command: listDatabases { listDatabases: 1.0 } keyUpdates:0 numYields:0 locks(micros) R:23663 r:128494 reslen:1300875 866ms

Seems related to SERVER-17078

Mongo version : 3.0.6
OS : Linux / Ubuntu 14.04
Storage configuration :
storageEngine = wiredTiger
wiredTigerCollectionBlockCompressor = snappy
wiredTigerCacheSizeGB = 3



 Comments   
Comment by Ian Whalen (Inactive) [ 12/Nov/15 ]

Hi Anthony, once again thanks a lot for the bug report. Because it looks like this is a particular symptom that we're going to be addressing through a few different fixes (SERVER-20961, SERVER-21067, SERVER-3181) I'm going to close this as a duplicate and link to those 3 issues.

We try to only keep issues open when we know that code will be committed and attached to it, and in this case the code that's relevant for you will show up on these other 3 issues; for updates as we continue to work on these please add yourself as a watcher there.

Comment by Anthony Brodard [ 27/Oct/15 ]

Hi David,

I think we can wait until the mentioned tickets are closed.
About disabling dbstat on MMS, it's already disabled since a very long time. The agent is up to date, and have been restarted many times since the modification.

Thank you for your help,
Anthony

Comment by David Hows [ 26/Oct/15 ]

Hi Anthony,

One last thing, you may be able to stop the MMS agent from running the listDatabases command by disabling dbstats collection.

You can find instructions on how to do this at: https://docs.cloud.mongodb.com/faq/#how-does-mms-gather-database-statistics

Regards,
David

Comment by David Hows [ 22/Oct/15 ]

Thanks Anthony,

Thats great background to this issue.

Thus far from investigating this issue we have raised 3 tickets, which suggest changes that could be made here to hopefully mitigate the problems you are seeing:

  1. SERVER-20961 - which aims to remove some of the overhead in the WiredTiger leg of each operation (currently targeted to be in 3.0.8 and in 3.2)
  2. SERVER-21067 - which looks to cache some of the data around database size, removing the need for a size lookup for each collection/index
  3. SERVER-21066/SERVER-3181 - option to return only the database names, removing some overhead

For now, I think there is little more we can do on this ticket as the work will be on the tickets listed above.

Let me know if there is more I can do to help here or if you are okay with my marking this ticket as being completed by the above tickets.

Regards,
David

Comment by Anthony Brodard [ 22/Oct/15 ]

Hi David,

Thank you for those useful explanation.

This problem occurs principally when :

  • The MMS Agent run a listDatabase on each mongod, located on the same server. In this case, a listDatabase can take more than 1 minute. Before 3.0.6, we had several issues because the agent closed the connection after few seconds (timeout), but the operation stayed active in the instance and was never released (even a killOp() doesn't solve the problem). We already had more than 10 listDatabases running on the same instance. This have been fixed since 3.0.6. I don't know why the agent need to run this command, and I don't know if it need the database size.
  • We already had a problem to do a full resync of a secondary member, because before starting replication, it run a listDatabases on the primary, and the query go in timeout after few seconds. SERVER-17078 have already considerably reduced the time to run this command, and we don't get any resync issue since the 3.0.5.

We list the databases manually rarely, and, even if it can be useful in some cases, the database size is not a essential information.

Regards,
Anthony

Comment by David Hows [ 22/Oct/15 ]

Hi Anthony,

Thanks for the extra logs, that has given me a much larger grasp on the situation.

I will speak in broad numbers here, but it appears that over the whole of your system you have only 4k databases. Within those databases there are 70k collections. And combined you have a total of 230k collections and indexes.

When you run the listDatabases command we do 2 things, first generate a list of all databases, and then we go and gather the size of each of those databases. Under MMAPv1 this is pretty straightforward as all the data for all the collections and indexes in a given databases are stored in that databases files, so we can just check on the size of those files.

Under WiredTiger we store each collection and each index in its own individual file. When we go to find the size of the database, we need to go out and gather the sizes from each of the 230K collections and indexes to report on the sizes. From the verbose outputs it shows that we do about 6 or 7 of these entities per microsecond on your system, which is where the time is spent.

I will be raising some tickets looking at ways we can improve how we go about this, but have a question for you. Do you use those database sizes when listing the databases? Or is this simply looking for all the names to iterate?

Thanks!
David

Comment by Anthony Brodard [ 21/Oct/15 ]

Thanks Ramon,

@David, I've uploaded a huge verbose log file.
I also updated the instance in 3.0.7.

Regards,
Anthony

Comment by Ramon Fernandez Marina [ 19/Oct/15 ]

anthony@brodard.me, I've created an upload portal David requested. Any data you upload will only be visible to MongoDB engineers.

Thanks,
Ramón.

Comment by Anthony Brodard [ 19/Oct/15 ]

Hi David,

I confirm we don't create or drop collection during normal operation.
Can you give me a secure way to send the verbose file, and confirm it will not be public ?

Regards,
Anthony

Comment by David Hows [ 18/Oct/15 ]

Hi Anthony,

Thanks for taking the time to gather that data for me and running those extra tests

I had hoped that your workload would mirror something akin to the workload seen in SERVER-20961, unfortunately, this does not appear to be the case (I see no drops or collection creation). Infact other than the list databases your system seems to perform well.

Would you be able to grab the verbose logs as you suggested? I'd like to review the operations you are running. If you ned a more private way to upload files so that they aren't attached to this ticket please let me know.

Thanks,
David

Comment by Anthony Brodard [ 16/Oct/15 ]

Hi David,

Sorry about the late reply.
I've attached an anonymized log file from the instance. I'm not in verbose mode, so I'm not sure you'll find interesting things...

Additionally, can you confirm how you are running the listDatabase command in these cases?
-> As you can see in the log file, a listDatabases is launched every 15 minutes... by the MMS Agent.

For information, we run 10 mongod instances on the same host (in docker containers). Sometimes, listDatabases command were stuck in one or many instances, locking all operations, and causing high load on the physical server. This problem seems to have been fixed since we have upgraded in 3.0.6, but slow listDatabases is still present.
To exclude the docker environment to the possible causes, I've moved and instance outside of Docker. All logs come from this instance.
I also shutting down all containers to keep only this mongod instance, but the problem was still here.
Restarting the instance don't have any impact.

If needed, I can activate verbose logs on the instance.

Regards,
Anthony

Comment by Anthony Brodard [ 16/Oct/15 ]

Add mongod logs

Comment by David Hows [ 15/Oct/15 ]

Hi Anthony,

I've been looking over this again, with context from another engineer.

One thing that has sprung up is that your system has 73K open data handles, which is a not insignificant amount, especially since your system was performing few operations at the time.

Would you be able to upload the log requested as some of the historical information may be able to inform why this value is where it is.

Additionally, would you be able to restart the instance and see if that has a positive impact?

Thanks,
David

Comment by David Hows [ 11/Oct/15 ]

Hi Anthony,

Sorry for the delay in getting back to you.

I've looked through the files you attached and have not yet been able to find any correlation.

Would you be able to grab the logfile as well? I added that to the initial request in a second edit so it may not have been emailed to you.

Additionally, can you confirm how you are running the listDatabase command in these cases?

Thanks,
David

Comment by Anthony Brodard [ 02/Oct/15 ]

Hi David,

I've added required files to the ticket.

Regards,
Anthony

Comment by David Hows [ 02/Oct/15 ]

Hi Anthony,

I've tried to reproduce this under 3.0.6 and listed off 5K databases in under 200ms, as below.

2015-10-02T13:56:24.523+1000 I COMMAND  [conn1] command admin.$cmd command: listDatabases { listDatabases: 1.0 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:287907 locks:{ Global: { acquireCount: { r: 10002 } }, Database: { acquireCount: { r: 5001 } } } 185ms

Would you be able to gather the run the following commands for me, then run the reproduction again from your system to see if we can shed further light on this issue:

mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:true}))); sleep(1000)}" >ss.log &
mongo --eval "db.version()" > dbv.log

Can you also attach the logfile from the instance which is showing the slowness.

Thanks,
David

Generated at Thu Feb 08 03:55:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.