[SERVER-18988] Terrible Query Performance Created: 16/Jun/15  Updated: 10/Mar/16  Resolved: 07/Mar/16

Status: Closed
Project: Core Server
Component/s: Performance, WiredTiger
Affects Version/s: 3.0.3
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: JinHui Assignee: Bruce Lucas (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File DimUserStatus.txt     Zip Archive MongoSlowQueryLog.zip     PNG File OpsManager_ss.png     PNG File Slow UserName.png     File dblog.tar     Text File dbstats.txt     File dignostic.tar     Zip Archive iSH_DB_Log.zip     Zip Archive iSH_DB_Log_L5.zip     PNG File incident.png     Text File profiler.txt     PNG File queries.png     Text File startuplog.txt     File status.gz.bz2    
Operating System: ALL
Participants:

 Description   

We are facing to a terrible query performance issue. There are only 5 million rows in the table. And the we add in index on username, the query we run are always against the username. Both nscaned and nscannedObjects are 1, however it will take more than 100 ms.



 Comments   
Comment by carl dong [ 10/Mar/16 ]

Sure , here is the log for 3 days .

Comment by Bruce Lucas (Inactive) [ 09/Mar/16 ]

Hi Carl,

Would you be able to also upload the mongod log showing the slow queries so that we can correlate them with events in the diagnostic data.

Thanks,
Bruce

Comment by carl dong [ 09/Mar/16 ]

Hi Ramon,

Tested Version 3.2.3 , this time I observe the issue happens everytime 3-4 seconds after database flush action, modify the flush interval from 60 to 20 seconds , the issue queries reduce but still exist.
Upload diagnostic files as attachement , please let us know if we could tune some parameters(maybe hidden ones).

Thanks ,

Carl Dong

Comment by Ramon Fernandez Marina [ 07/Mar/16 ]

Unfortunately I have no more information to provide at this point, sorry about that. I was not able to find the root cause with the provided data, nor was I able to reproduce this behavior on my end. Without any additional information that shows evidence of a bug in the server I'm afraid I have to close this ticket for the time being.

As I mentioned above, I'd recommend testing MongoDB 3.2.3. If the behavior persists please post in this ticket and I'll be happy to reopen it.

Regards,
Ramón.

Comment by JinHui [ 02/Mar/16 ]

As you can see from Carl's screenshot, all of the slow query happened in batch. And most of the slow query are update or insert command, there is now key update for these query. Any thoughts?

Comment by carl dong [ 02/Feb/16 ]

In fact , the issue frequently happens , as you see in the ops manager profiler page.

Comment by Ramon Fernandez Marina [ 01/Feb/16 ]

Carl, since I haven't found a clear root cause I can't say for sure whether upgrading will help. Note however that MongoDB 3.2 makes WiredTiger the default storage engine, and a lot of work was put into making things faster and more stable – so if this turns out to be an unfortunate interaction with the storage engine then 3.2.1 (our latest stable release right now) should definitely help.

MongoDB 3.2 also makes it easier to collect diagnostics data, as such data is recorded in the background and there's no need to wait until a given problem happens again.

Comment by carl dong [ 28/Jan/16 ]

Hi Ramon,

We are running on Mongodb-enterprise 3.0.7 , but issue still exists , even worse . .

If diagnostic data would help in 3.2.0 version , we will plan a upgrade for that .

Thanks ,

Carl Dong

Comment by Ramon Fernandez Marina [ 27/Jan/16 ]

Hi cainiaoiswo, apologies for the long delay in getting back to you. We were not able to identify any concluding evidence of what may be causing the slow queries reported. The data you sent us didn't allow us to identify what resource may be being temporarily held and slowing the queries down.

I'll take another look at the uploaded data to see if we missed something. Am I right to assume you're still experiencing this issue? Or has a recent upgrade helped speed queries up?

Thanks,
Ramón.

Comment by JinHui [ 21/Oct/15 ]

Hi Ramon, is there any good news for me?

Comment by Ramon Fernandez Marina [ 09/Sep/15 ]

Apologies for the radio silence cainiaoiswo - yes, this issue is till open and is being investigated. The information collected so far is not conclusive as of yet; we'll post updates to this ticket as they happen.

Regards,
Ramón.

Comment by JinHui [ 13/Aug/15 ]

Please let me know if you are still working on this issue.

Comment by JinHui [ 31/Jul/15 ]

Hi Bruce, We've upgraded our DB to 3.0.5 but the issue still exists. By the way, you'd better use multiple threads to reproduce the issue, The number of the slow query during peak time is hundreds time than other time.

Comment by JinHui [ 17/Jul/15 ]

Hi Bruce, thanks for your update. I've attached more information in the thread.
For question 1: Most of the slow query are write. Sometime we can also find some slow read query but just a little.
For question 2: I've turned off the profiler.

Comment by Bruce Lucas (Inactive) [ 16/Jul/15 ]

Hi JinHui,

Thanks for providing that information, and sorry for the delay in responding.

That data shows one brief incident, at 20:20:20, where a dozen write (insert and update operations) all completed within about 1ms of each other, the longest of them having taken 640ms, suggesting that they had been held up accessing some common resource and then all completed quickly when that resource became available. It affects a couple of different collections, and as you have noted the issue often affects more than one database at the same time, suggesting that the common resource is in some lower layer of the system, such as the storage engine or storage itself.

In the data we do find some correlation with that one incident and some internal mongod counters. (I attached a graph that we are using for internal discussions about the issue showing that correlation.) However, as it is only one brief incident is too little for drawing conclusions. I hate to have to ask you for more information, but would you be able run the same data collection (serverStatus, iostat, and mongod log) for a longer period of time, maybe a few hours or a day, suffcient to cover a few incidents, so that we can see if they all follow the same pattern?

Also let me ask a couple questions:

  • are the slow operations you are concerned with always writes (inserts or updates)?
  • can you confirm that you have turned profiling off? I just want to rule out the possibility that it is having an impact.

Thanks,
Bruce

Comment by JinHui [ 15/Jul/15 ]

The issue still exists.
/* 5 */

{ "op" : "insert", "millis" : 830, "ts" : ISODate("2015-07-15T07:53:44.188Z") }

/* 6 */

{ "op" : "insert", "millis" : 560, "ts" : ISODate("2015-07-15T07:53:44.188Z") }

/* 7 */

{ "op" : "insert", "millis" : 534, "ts" : ISODate("2015-07-15T07:53:44.188Z") }

/* 8 */

{ "op" : "insert", "millis" : 281, "ts" : ISODate("2015-07-15T07:53:44.188Z") }

/* 9 */

{ "op" : "insert", "millis" : 694, "ts" : ISODate("2015-07-15T07:53:44.188Z") }

/* 10 */

{ "op" : "insert", "millis" : 730, "ts" : ISODate("2015-07-15T07:53:44.188Z") }

/* 11 */

{ "op" : "insert", "millis" : 720, "ts" : ISODate("2015-07-15T07:53:44.187Z") }

/* 12 */

{ "op" : "insert", "millis" : 428, "ts" : ISODate("2015-07-15T07:53:44.187Z") }

/* 13 */

{ "op" : "insert", "millis" : 451, "ts" : ISODate("2015-07-15T07:53:44.187Z") }

/* 14 */

{ "op" : "insert", "millis" : 201, "ts" : ISODate("2015-07-15T07:53:44.187Z") }

/* 15 */

{ "op" : "insert", "millis" : 216, "ts" : ISODate("2015-07-15T07:53:44.187Z") }

/* 16 */

{ "op" : "insert", "millis" : 898, "ts" : ISODate("2015-07-15T07:53:44.187Z") }

/* 17 */

{ "op" : "insert", "millis" : 482, "ts" : ISODate("2015-07-15T07:53:44.186Z") }

/* 18 */

{ "op" : "insert", "millis" : 919, "ts" : ISODate("2015-07-15T07:53:44.186Z") }

/* 19 */

{ "op" : "insert", "millis" : 464, "ts" : ISODate("2015-07-15T07:53:44.186Z") }

/* 20 */

{ "op" : "insert", "millis" : 509, "ts" : ISODate("2015-07-15T07:53:44.186Z") }

/* 21 */

{ "op" : "insert", "millis" : 234, "ts" : ISODate("2015-07-15T07:53:44.186Z") }

/* 22 */

{ "op" : "insert", "millis" : 823, "ts" : ISODate("2015-07-15T07:53:44.186Z") }

/* 23 */

{ "op" : "insert", "millis" : 264, "ts" : ISODate("2015-07-15T07:53:44.186Z") }

/* 24 */

{ "op" : "insert", "millis" : 766, "ts" : ISODate("2015-07-15T07:53:44.186Z") }

/* 25 */

{ "op" : "insert", "millis" : 315, "ts" : ISODate("2015-07-15T07:53:44.186Z") }

/* 26 */

{ "op" : "insert", "millis" : 458, "ts" : ISODate("2015-07-15T07:53:44.186Z") }

/* 27 */

{ "op" : "insert", "millis" : 677, "ts" : ISODate("2015-07-15T07:53:44.185Z") }

/* 28 */

{ "op" : "insert", "millis" : 274, "ts" : ISODate("2015-07-15T07:53:44.185Z") }

/* 29 */

{ "op" : "insert", "millis" : 444, "ts" : ISODate("2015-07-15T07:53:44.185Z") }

/* 30 */

{ "op" : "insert", "millis" : 660, "ts" : ISODate("2015-07-15T07:53:44.185Z") }

/* 31 */

{ "op" : "insert", "millis" : 739, "ts" : ISODate("2015-07-15T07:53:44.185Z") }
Comment by JinHui [ 05/Jul/15 ]

Still can't reproduce the issue?

Comment by JinHui [ 30/Jun/15 ]

Thanks for your updates. Here are some other information may be helpful. We have two database in this server. One is for normal business request, which contains the table DimUser we mention in this thread. The other one is for log request, which records about 200 rows/second. And we find that the slow queries always occur in two databases at the same time.

Comment by Ramon Fernandez Marina [ 29/Jun/15 ]

Hi cainiaoiswo, we're still investigating this issue. The last batch of information you sent does show slow queries in the range of hundreds of milliseconds, but haven't been able to find the cause; I've also tried reproducing this on my end, unsuccessfully. I'll let you know if I need any further information – thanks for your patience and understanding.

Regards,
Ramón.

Comment by JinHui [ 29/Jun/15 ]

Do you need any other Log or information from my side?

Comment by JinHui [ 24/Jun/15 ]

Hi Ramon, here is the log we caught last night.

Comment by Ramon Fernandez Marina [ 22/Jun/15 ]

cainiaoiswo, thanks for uploading the requested information. While I do see the long queries in the logs I haven't been able to find the "smoking gun", so we'll need more information to track down the root cause of the issue. I'd like to ask you to run the following commands on the node you're seeing the slow queries (I presume is your primary):

mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(200)}" > ss.log &
iostat -k -t -x 1 > iostat.log &

These will monitor your mongod instance and log information to the files ss.log and iostat.log for as long as they run. The hope is to record system information at the same time you're experiencing slow queries, so please continue to run these commands until you see some slow queries, after which you can interrupt the monitoring commands and upload the ss.log and iostat.log files so we can analyze them.

Thanks,
Ramón.

Comment by JinHui [ 18/Jun/15 ]

Is there any update for my issue?

Comment by JinHui [ 17/Jun/15 ]

Hi Ramon, here is the Level 5 log. And the png file seems like a slow query plan. As I found the query plan was decide at 2015-06-18T00:34:01.000 and it keep running for 285ms.

Comment by JinHui [ 17/Jun/15 ]

Here is the server log. As you can see in the file, since line 1000, lots of query cause more then thousand of MS.

Comment by JinHui [ 17/Jun/15 ]

The problem is going worse. Lots of query take more than 1000 ms. Seems our production environment is going crush.

Comment by JinHui [ 17/Jun/15 ]

Hi Ramon, here is some log info you need.

Comment by JinHui [ 17/Jun/15 ]

Most of the time, the query time is OK, which only take about several ms.

Comment by Ramon Fernandez Marina [ 17/Jun/15 ]

cainiaoiswo, in addition to Dan's question we'll need some more information to troubleshoot this issue.

  1. Would it be possible for you to run the following commands and send us the resulting logs?

    // Increase log level, disable profiler but log all queries
    use admin
    var old_log = db.setLogLevel(2)
    var old_profiler = db.setProfilingLevel(0,0)
     
    // Run the slow query
    use DimUser
    <the slow username query originally reported>
     
    // restore log and profiler values
    use admin
    db.runCommand({setParameter:1, logComponentVerbosity:old_log.was})
    db.setProfilingLevel(old_profiler.was, old_profiler.slowms)
    

    Please note that this will temporarily disable the profiler altogether.

  2. The output of the following commands may be useful as well:

    db.stats()
    db.<collection>.stats()
    


  3. The startup logs for this server

Thanks,
Ramón.

Comment by Daniel Pasette (Inactive) [ 17/Jun/15 ]

Even without knowing what else is happening on this server, that is an unusually long time for the query and document size. If you run the same query repeatedly, does the time to execute stay the same?

Comment by JinHui [ 17/Jun/15 ]

avgObjSize is 256 Byte.

Comment by Daniel Pasette (Inactive) [ 17/Jun/15 ]

Can you tell me how large the document you are querying for is?

Comment by JinHui [ 17/Jun/15 ]

Our platform is CentOS6.5; MongoDB 3.0.3 WiredTiger
CPU:2*2670v2 2.5GHz
Memory: 64GB
HardDrive: 2TB 7.2K RPM

Please let me know if you need any other information.

Generated at Thu Feb 08 03:49:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.