[SERVER-30710] Several seconds for indexed query - global lock? Created: 17/Aug/17  Updated: 09/Oct/17  Resolved: 07/Sep/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vlad Zloteanu Assignee: Dmitry Agranat
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-30707 server crash: invariant() failure Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

I find, in the slow query logs, queries that make no sense. An excerpt is below:

2017-08-17T09:32:03.964+0200 I COMMAND  [conn8550] query XX.statistic query: { date: new Date(1502917200000), monitor: [ "monitor|general" ] } planSummary: IXSCAN { monitor: 1, date: 1 } ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 writeConflicts:0 numYields:2 nreturned:1 reslen:155 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 3 } } } 23606ms
2017-08-17T09:32:03.964+0200 I COMMAND  [conn8519] query XX.trigger query: { token: "live/dynamic_subscription/live", status: true, retired: false, ref: 17 } planSummary: IXSCAN { token: 1 } ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 writeConflicts:0 numYields:2 nreturned:1 reslen:13278 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 3 } } } 33632ms
2017-08-17T09:32:03.964+0200 I COMMAND  [conn8493] query XX.trigger query: { token: "live/product_recommender/live", status: true, retired: false, ref: 20 } planSummary: IXSCAN { token: 1 } ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 writeConflicts:0 numYields:2 nreturned:1 reslen:22196 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 3 } } } 29860ms
2017-08-17T09:32:03.970+0200 I COMMAND  [conn8518] query XXX.statistic query: { date: new Date(1502917200000), monitor: [ "monitor|general" ] } planSummary: IXSCAN { monitor: 1, date: 1 } ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 writeConflicts:0 numYields:2 nreturned:1 reslen:155 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 3 } } } 20648ms

20 seconds for a query that hits an index?

Sometimes I also get queries on an ID:

2017-08-17T10:00:19.909+0200 I COMMAND  [conn10311] query XX.dlv_cs_similar query: { _id: ObjectId('59954d0fcaf95bd83e8b51f6') } planSummary: IDHACK ntoskip:0 nscanned:0 nscannedObjects:0 idhack:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } 4701ms

It occurs suddenly, every several minutes. It affects 20-30 queries, than nothing. It's like, from time to time, there is a global lock that occurs.

More details about my setup:

  • System: Debian 9
  • Mongo version: 3.0.15
  • 2 SSD disks in raid0
  • Mongo RAM usage: 13.4 GB / 128 G
  • CPU @20%

Mongostat does not give anything suspicious:

12  3781    333     96     174   525|0     1.6   18.5       0 16.7G 15.5G   0|1  21|9  641k    12m  155 rplset04  PRI 09:54:14
    12  4344    380    105     188   295|0     1.6   18.5       0 16.7G 15.5G   0|0 23|12  708k    11m  137 rplset04  PRI 09:54:15
    10  3221    319    133     163   241|0     1.7   18.5       0 16.7G 15.5G   0|0  23|7  543k    14m  128 rplset04  PRI 09:54:16
    18  2915    319    114     172    90|0     1.7   18.5       0 16.7G 15.5G   0|0 22|10  511k    17m  128 rplset04  PRI 09:54:17

One primary, one replica, one voter.

For multitenancy, I am using one DB per client (600 databases). They take 185 GB on the disk.



 Comments   
Comment by Dmitry Agranat [ 07/Sep/17 ]

Hi Vlad,

Thank you for providing the requested information.

I have reviewed the uploaded data. During the reported time frame, the operation rate is between ~18K and 24K and for 99.9% of the time, the operation response time is ~240ms. There are 8 occurrences where we can see a few queries are taking ~20sec. Unfortunately, the data you've uploaded did not suffice the cause of these occurrences. This does not mean that there is no problem, it just means that currently, in 3.0.15 the diagnostic data does not expose the potential cause.

As mentioned earlier by Thomas, I recommend upgrading the cluster to one of our latest versions, preferably 3.4.7. Apart from many significant improvements since 3.0.x versions, we have also significantly improved our diagnostics capabilities, which would help us to better understand the cause of these occurrences.

If after upgrading your cluster you still see this issue reoccurs, please open a new case and attach an archive of the diagnostic.data directory in your $dbpath and the complete mongod log files.

Thank you,
Dima

Comment by Vlad Zloteanu [ 29/Aug/17 ]

Hello @Thomas Schubert,

Did you get the chance to look into the logs I uploaded?

Thanks,
Vlad

Comment by Vlad Zloteanu [ 22/Aug/17 ]

Thank you for your reply, Thomas.
I have attached the requested logs for 2 servers:

  • a single instance mongo, having 1-5k updates per second;
  • the primary from a replicaset, with 2-5 hundreds updates per second
Comment by Kelsey Schubert [ 22/Aug/17 ]

Hi vladzloteanu,

Sorry, I missed that you were running MongoDB 3.0, which does not capture diagnostic.data. As Ramón mentioned in SERVER-30707, there have been significant improvements to WiredTiger since the release of MongoDB 3.0, and, after reviewing your logs, I believe that it is likely that upgrading will resolve this issue as well. To confirm my understanding of this issue, I'd like some additional information.

Would you please run the following script for an hour?

delay=1
mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:true}))); sleep(1000*${delay:?})}" >ss.log &
iostat -k -t -x ${delay:?} >iostat.log &

Afterwards, please provide both generated files, as well as the mongod.log files covering the time period.

Thank you,
Thomas

Comment by Vlad Zloteanu [ 17/Aug/17 ]

I have uploaded the logs.
Please disregard everything that happened before 2017-08-17T19:00:00, as some mongorestores were running.

Here is an example of a line:

2017-08-17T21:19:49.837+0200 I COMMAND  [conn38102] query db_a***o.buffer query: { _id: ObjectId('5995ec49d5bc6d9b2c8b4dfe') } planSummary: IDHACK ntoskip:0 nscanned:0 nscannedObjects:0 idhack:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } 12321ms

Here's another one:

2017-08-17T21:24:10.617+0200 I COMMAND  [conn39247] query db_c***o.push_dlv_item query: { trigger.keyword: "cart_saver", receiver.osuserid: "null", delivery.created_at: { $gte: new Date(1502393033000) } } planSummary: IXSCAN { trigger.keyword: 1, receiver.osuserid: 1, delivery.created_at: 1 } ntoskip:0 nscanned:0 nscannedObjects:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } 16736ms

I can also give you access to the datadog monitoring account.

Please let me know if you need anything else.

Comment by Vlad Zloteanu [ 17/Aug/17 ]

Hello Thomas,
I have no diagnostic.data folder. My current mongo version is 3.0.15.
Should I have such a folder?

Comment by Kelsey Schubert [ 17/Aug/17 ]

Hi vladzloteanu,

Thanks for the report. So we can investigate, would you please provide an archive of the diagnostic.data directory in your $dbpath and the complete mongod log files for the affected node?

I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time.

Thank you,
Thomas

Comment by Vlad Zloteanu [ 17/Aug/17 ]

Also, as there are 600+ databases, the mongodb process has around 62k open FDs.

Comment by Vlad Zloteanu [ 17/Aug/17 ]

I am available for any other information necessary.
I can provide the newrelic and datadog monitoring.

They are the same servers as on SERVER-30707.

Generated at Thu Feb 08 04:24:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.