[SERVER-32890] Background index creation sometimes block whole server for dozen of seconds (even with empty collections) Created: 25/Jan/18  Updated: 30/Oct/23  Resolved: 04/Dec/18

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.3
Fix Version/s: 3.4.6

Type: Bug Priority: Major - P3
Reporter: Adrien Jarthon Assignee: Eric Milkie
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File ci.png     PNG File cpu.png     PNG File disk-io.png     PNG File ftdc.png     PNG File image-2018-01-25-14-25-38-568.png     PNG File image-2018-01-25-14-26-10-841.png     PNG File load.png     PNG File mem.png     Text File mongo_drop_log.txt     PNG File screenshot-1.png     PNG File screenshot-2.png     PNG File screenshot-3.png     PNG File screenshot-4.png     PNG File swap.png    
Issue Links:
Related
related to SERVER-39355 Collection drops can block the server... Closed
is related to SERVER-31704 Periodic drops in throughput during c... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Storage NYC 2018-12-17
Participants:
Case:

 Description   

Ok this one is an issue we're having regularly for months but couldn't pinpoint the cause, so now that we have managed to collect more data I'm opening a ticket.

The problem is that basically some operation (we recently reproduced it with index creation but also got the same impact with a collection drop) simply blocks the whole server during 10-30 seconds for no apparent reason. The CPU/RAM/IO are fine, and the query is not even reported as taking time in the slow query log. It seems like the server is doing something internally which takes times (maybe eviction, checkpoint, or whatever) and when this falls during a query which needs a global lock (even for a short time) it can hold this lock for way longer than expected.

I captured everything I could, here is what I have:
Yesterday night (when the load is low) I created about 200 indexes (in background) on existing collections (one after the other), and some of them blocked the DB for 10-30 seconds (about 10) and the rest didn't have any impact. It is not related to the index size, as the big ones did not necessarily had more change to lock, the one I caught were mostly small indexes.

Here is the impact on the main application (database1):

And on another application using a different database (database2):

Here is an example of mongo log:

2018-01-24T23:16:25.005+0100 I -        [conn44444622]   Index Build (background): 3644200/3696664 98%
2018-01-24T23:16:25.701+0100 I INDEX    [conn44444622] build index done.  scanned 3696664 total records. 69 secs
2018-01-24T23:16:25.712+0100 I COMMAND  [conn44444622] command database1.$cmd command: createIndexes { createIndexes: "collection1", indexes: [ { key: { updated_at: 1 }, background: true, name: "updated_at_1" } ] } numYields:29161 reslen:113 locks:{ Global: { acquireCount: { r: 29163, w: 29163 } }, Database: { acquireCount: { w: 29163, W: 2 }, acquireWaitCount: { w: 2, W: 2 }, timeAcquiringMicros: { w: 548708, W: 11723 } }, Collection: { acquireCount: { w: 29162 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_query 69394ms
 
2018-01-24T23:16:25.774+0100 I COMMAND  [conn44444051] command database1.collection3 command: find { find: "collection3", filter: { source_id: { $in: [ ObjectId('50e160a0a90ffbe65700b488') ] } }, limit: 50, skip: 317700, sort: { created_at: 1 } } planSummary: IXSCAN { source_id: 1, created_at: -1 } keysExamined:317712 docsExamined:317712 cursorExhausted:1 numYields:2483 nreturned:12 reslen:30521 locks:{ Global: { acquireCount: { r: 4968 } }, Database: { acquireCount: { r: 2484 }, acquireWaitCount: { r: 41 }, timeAcquiringMicros: { r: 17054 } }, Collection: { acquireCount: { r: 2484 } } } protocol:op_query 537ms
 
2018-01-24T23:16:39.329+0100 I INDEX    [conn44444622] build index on: database1.collection2 properties: { v: 1, key: { updated_at: 1 }, name: "updated_at_1", ns: "database1.collection2", background: true }
 
2018-01-24T23:16:39.331+0100 I COMMAND  [conn44449316] command database2.collection1 command: find { find: "collection1", filter: { api_key: "xxxxxxxxxxxxxxxx" } } planSummary: IXSCAN { api_key: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:4910 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 3359ms
 
/// A lot of queries which were blocked finished here
 
2018-01-24T23:16:42.004+0100 I -        [conn44444622]   Index Build (background): 213300/219823 97%
2018-01-24T23:16:42.068+0100 I INDEX    [conn44444622] build index done.  scanned 219823 total records. 2 secs
2018-01-24T23:16:42.069+0100 I COMMAND  [conn44444622] command database1.$cmd command: createIndexes { createIndexes: "collection2", indexes: [ { key: { updated_at: 1 }, background: true, name: "updated_at_1" } ] } numYields:1728 reslen:113 locks:{ Global: { acquireCount: { r: 1730, w: 1730 } }, Database: { acquireCount: { w: 1730, W: 2 }, acquireWaitCount: { w: 1, W: 1 }, timeAcquiringMicros: { w: 284059, W: 1071 } }, Collection: { acquireCount: { w: 1729 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_query 16005ms

Here we can see:

  1. An index build finished on database1.collection1
  2. An unrelated query
  3. 14 seconds without a single line (that's when the DB is locked)
  4. The next index build starts on database1.collection2
  5. A very simple query which took 3.3s because it was blocked whereas it's not even on the same database!
  6. Then I removed hundreds of queries in the same situation (basically all of them)
  7. Then we can see the end of the index creation on database1.collection2 which took 2 seconds

Here is how it looked in mongostat:

insert query update delete getmore command dirty  used flushes vsize  res   qrw   arw net_in net_out conn    set repl                time
     2  1421     48     *0      65   572|0  0.7% 80.0%       0  184G 139G   0|0   2|1   386k   7.82m 1147 dimelo  PRI Jan 24 23:16:00.984
    *0  1201     16     *0      25   358|0  0.7% 80.2%       0  184G 139G   0|0   1|1   277k   5.63m 1158 dimelo  PRI Jan 24 23:16:05.984
     4  1171     36     *0      49   384|0  0.7% 80.1%       0  184G 139G   0|0   1|1   296k   5.35m 1164 dimelo  PRI Jan 24 23:16:10.984
     7  1578     38     *0      56   492|0  0.8% 80.6%       1  184G 139G   0|0   0|1   401k   7.36m 1167 dimelo  PRI Jan 24 23:16:15.985
     9  2407     56      2      72   474|0  0.9% 81.4%       0  184G 139G   0|0   0|2   819k   11.4m 1167 dimelo  PRI Jan 24 23:16:20.984
    18  2132     76      7     108   602|0  0.9% 82.2%       0  184G 139G   0|0   0|0  1.08m   12.2m 1167 dimelo  PRI Jan 24 23:16:25.984
     1   343      9     *0      14   212|0  1.0% 82.2%       0  184G 139G   0|0  85|6   101k    941k 1220 dimelo  PRI Jan 24 23:16:30.984
    *0   330      4     *0       9   182|0  1.0% 82.2%       0  184G 139G  10|0 256|6  86.5k    766k 1321 dimelo  PRI Jan 24 23:16:35.984
    12  1184     36     *0      44   746|0  0.8% 80.3%       0  185G 139G   0|0   3|1   350k   9.67m 2015 dimelo  PRI Jan 24 23:16:40.984
    24  1712    114      2     133   792|0  0.5% 80.0%       0  185G 139G   0|0   3|1   696k   15.6m 2016 dimelo  PRI Jan 24 23:16:45.983

Here we can see the number or active read rising to 256 before starting to queue some, and the number of queries dropping to 0, during the blocking time.

I also ran a periodic db.currentOp() every 5 seconds in a file to make sure to capture the exact state in which the requests were during the lock. I can upload complete version to your private portal if you need but here's the interesting part:

{
  "desc" : "conn44444622",
  "threadId" : "140407966885632",
  "connectionId" : 44444622,
  "client" : "10.10.11.31:39368",
  "active" : true,
  "opid" : -935082558,
  "secs_running" : 9,
  "microsecs_running" : NumberLong(9571622),
  "op" : "command",
  "ns" : "database1.$cmd",
  "query" : {
    "createIndexes" : "collection2",
    "indexes" : [
      {
        "key" : {
          "updated_at" : 1
        },
        "background" : true,
        "name" : "updated_at_1"
      }
    ]
  },
  "numYields" : 0,
  "locks" : {
    "Global" : "w",
    "Database" : "W"
  },
  "waitingForLock" : false,
  "lockStats" : {
    "Global" : {
      "acquireCount" : {
        "r" : NumberLong(1),
        "w" : NumberLong(1)
      }
    },
    "Database" : {
      "acquireCount" : {
        "W" : NumberLong(1)
      }
    }
  }
},

At about 23:16:36+0100 we can see the query is still holding a database write exclusive lock, for more than 9 seconds now, without a single yield. It seem the global lock is only an intent write lock so I'm not sure if this can work concurrently but queries on other DBs seemed blocked too as you can see in the mongo log.
Of course all other queries visible in the currentOp are "waitingForLock" : true for between 0 and 9 seconds.

5 seconds later (about 23:16:41) we can see the index has now started being built in background for real this time and is not longer holding a write exclusive lock and is yielding this time:

{
  "desc" : "conn44444622",
  "threadId" : "140407966885632",
  "connectionId" : 44444622,
  "client" : "10.10.11.31:39368",
  "active" : true,
  "opid" : -935082558,
  "secs_running" : 14,
  "microsecs_running" : NumberLong(14732154),
  "op" : "command",
  "ns" : "database1.$cmd",
  "query" : {
    "createIndexes" : "collection2",
    "indexes" : [
      {
        "key" : {
          "updated_at" : 1
        },
        "background" : true,
        "name" : "updated_at_1"
      }
    ]
  },
  "msg" : "Index Build (background) Index Build (background): 107726/219823 49%",
  "progress" : {
    "done" : 107727,
    "total" : 219823
  },
  "numYields" : 846,
  "locks" : {
    "Global" : "w",
    "Database" : "w",
    "Collection" : "w"
  },
  "waitingForLock" : false,
  "lockStats" : {
    "Global" : {
      "acquireCount" : {
        "r" : NumberLong(847),
        "w" : NumberLong(847)
      }
    },
    "Database" : {
      "acquireCount" : {
        "w" : NumberLong(847),
        "W" : NumberLong(1)
      },
      "acquireWaitCount" : {
        "w" : NumberLong(1)
      },
      "timeAcquiringMicros" : {
        "w" : NumberLong(284059)
      }
    },
    "Collection" : {
      "acquireCount" : {
        "w" : NumberLong(847)
      }
    }
  }
},

Of course this 15sec delay (sometimes 30sec) can't be explained by slow IO as we're running on a healthy Raid 10 of 4 × 600G SSD. The IO usage on the machine is between 0 an 1%. We do have a lot of collections though, so it maybe be related to the work WT has to do when creating/removing namespace as we seems to encounter the issue on index/collection creation/deletion.

I have extracted the diagnostic.data dir and can upload it to your portal if you want of course.
For the record we had the same issue again today at 12:30 (+0100) where a new index created in background on a empty collection caused a 16s global lock:

2018-01-25T12:30:11.582+0100 I COMMAND  [conn44523028] command database1 ...(last working query)... protocol:op_query 111ms
 
2018-01-25T12:30:27.255+0100 I INDEX    [conn44531041] build index on: database1.collection4 properties: { v: 1, key: { installation_id: 1, created_at: 1 }, name: "installation_id_1_created_at_1", ns: "database1.collection4" }
2018-01-25T12:30:27.256+0100 I INDEX    [conn44531041]   building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2018-01-25T12:30:27.265+0100 I INDEX    [conn44531041] build index done.  scanned 0 total records. 0 secs



 Comments   
Comment by Eric Milkie [ 01/Feb/19 ]

Hi Adrien,

I filed SERVER-39355 to track this issue – I'd like to move the new discussion there.  Sorry I didn't do this sooner.

Comment by Adrien Jarthon [ 01/Feb/19 ]

Hi again, any chance to get this looking at? I still have the diagnostic.data ready to be uploaded.

Comment by Adrien Jarthon [ 17/Dec/18 ]

Hi, sorry but we've just had another occurrence today (still running 3.4.13) so there's still an issue here. We've modified our code to drop collection to sleep 10 sec between each deletion (to give mongo some time to recover after the "short" global lock and not kill the platform) but unfortunately this wasn't enough and it killed the global performance:

After investigation I found that this was cause by some collection deletion. I tried to upload the diagnostic.data but the portal specified earlier doesn't accept files any more. I can upload it if you give another portal.

Here is the log from the drop queries: mongo_drop_log.txt, we can see here that they are spaced by 10sec (+drop duration) and that the drop take A LOT of time (all these collections were empty or had 5 records at most). They had some indexes though, which are not shown here but probably had to be destroyed at the same time. I don't know if it's a checkpoint global lock issue again but it's definitely still not possible to drop collection in a big 3.4.13 mongo without killing it. For the record we have ~40k namespaces, this has not changed much since the db.stats I reported above.

And before you say this is probably fixed in a more recent version, we'll need better proof than last time considering the high risk of upgrading...

Comment by Eric Milkie [ 04/Dec/18 ]

bigbourin@gmail.com, after some investigation, I believe the tickets that improved your issue were WT-3207 in 3.4.4 and WT-3362 in 3.4.6. Both tickets improved behavior regarding the WiredTiger schema lock and long running checkpoints, and one of the tickets specifically addressed table drops during checkpoints.

I agree with you that our release notes for minor versions have in the past mixed performance enhancements, bug fixes, and cosmetic changes together without regard to severity. We are actively trying to improve our release notes in this regard, and strive to do better in the future.

Comment by Adrien Jarthon [ 06/Apr/18 ]

Hello, for the record due to the lack of response and the urgency of the issue for us, we tried upgrading to mongodb 3.4.13 last month even though we couldn't find a clear changelog entry about this issue, and it seems to be much better now. We've created/deleted a couple indexes without any long locks since then. The checkpoint time reduced but not much so the impact is not just on this part, it seems something did improve in the way the lock is acquired or held. Though I must say we both could have save some time if the changelog was clearer about the important bug fixes and performance improvements. Even you were not able to tell me that this was fixed (or greatly improved) in the latest 3.4.x version.

Comment by Adrien Jarthon [ 21/Feb/18 ]

Ah yes, thanks, I was able to compute the average checkpoint time using "transaction checkpoint total time (msecs)" / "transaction checkpoints"
It's between 25 et 35s for our 4 servers, though I see durations going up to 130s during the day for ~37000 tables.

On our staging server (almost no traffic) I see avg checkpoint time at 500ms for ~5500 tables, which means it doesn't scale linearly on tables count, or at least it scales on something else too, does it depend on ops?

On another server for my personal project (running 3.4.10) I see an avg checkpoint time at 1.5s for ~50 tables, which is an interesting number, although on a smaller machine it's still powered by SSD and neither CPU nor IO are throttling, I hope checkpoint performance did not degrade that much between 3.4.3 and 3.4.10 ☺

Comment by Bruce Lucas (Inactive) [ 21/Feb/18 ]

Hi Adrien,

You can monitor the following serverStatus metric to get an idea of how long checkpoints are taking:

db.serverStatus()['wiredTiger']['transaction']['transaction checkpoint most recent time (msecs)']

Bruce

Comment by Adrien Jarthon [ 21/Feb/18 ]

Ok great, glad it helped

Do I have any way to measure or estimate the checkpoint duration without trying to block the production DB?
So I can measure the impact of a restart or of deleting a couple collections?

Comment by Bruce Lucas (Inactive) [ 17/Feb/18 ]

Hi Adrien,

Yes, your hypothesis about the cause was good, and I should have mentioned that it helped in pinpointing the issue. We'll pass your suggestions on to the development team.

Regarding the impact of the issue in your case - yes, I suspect that it could have to do with the relatively large number of tables (collections and indexes) in your case, as one phase of the checkpoint reviews this list while holding the schema lock. If you are able to reasonably reduce this number you may see shorter stalls; please let us know if that is the case.

Thanks,
Bruce

Comment by Adrien Jarthon [ 16/Feb/18 ]

Thanks for this great explanation Bruce it confirms what I only guessed and reassures me that there are things which can be done to improve this (as it's passive wait time). First of all are you aware of any change after 3.4.3 improving this already? I couldn't find any in the changelog but maybe you know better.

I guess this is more complicated than that of course but just in case I'll give my suggestions: this is a classic case of two different lock and the bigger one is held while waiting for the smaller one, I can think of a couple options:

  • Release the db lock while waiting for the WT schema lock, of course this is only possible if the operations done before this are safe to be left unlocked.
  • Acquire the WT schema lock first? if this one doesn't block the operation of the DB it seems better to wait for this one first and acquire the DB lock only then so we're sure to hold it for a shorter time.

It doesn't seem like a lot of people are complaining about this issue whereas it doesn't look very specific to our use-case, any idea of whats make us more prone to hit this issue? maybe the checkpoint process is very log for us? (I don't know what a normal value is, it seems to be running ~40% of the time for us?) is it because we have a lot of namespaces? would reducing this number help? Any other tips to reduce the checkpoint time?

Comment by Bruce Lucas (Inactive) [ 16/Feb/18 ]

Hi Adrien,

Thanks for the report. I wanted to update you with some progress on analyzing this issue. During an index build we

  • take the db W (MODE_X) lock (whether foreground or background index build)
  • among other things, create the WT table for the index (in MultiIndexBlock::init)
    • this in turn requires taking the WT schema lock
  • then if background index build downgrade to db w (MODE_IX) lock for the actual index data creation

It appears we are getting hung up creating the WT table while holding the db W lock during checkpoints because the WT schema lock is held for an extended period:

  • the createIndex chart shows the individual index builds
  • problematic index builds C-D, E-F, G-H occur during a checkpoint, and the "schema lock application thread wait time" shows we are getting held up waiting for the WT schema lock

We're still considering next steps on this issue and will let you know when we have an update.

Bruce

Comment by Adrien Jarthon [ 13/Feb/18 ]

Hello Mark, mm this ticket seems to say it blocks the secondary when replicating the drop but in our case it also happens on the primary and not only when a collection/index is dropped, the first and most detailed report I put in this ticket only talk about index creation (no drop) and blocked the PRIMARY, so it's not the same thing IMO.

Comment by Mark Agarunov [ 13/Feb/18 ]

Hello bigbourin@gmail.com,

Thank you for providing the additional data and information. After digging into this further, I believe this is an instance of SERVER-21307, where any operation that affects collection or index metadata will block if a drop is issued at the same time, so I've closed this ticket as a duplicate. Unfortunately this does not appear to be marked as fixed yet, so please follow SERVER-21307 for updates on this issue.

Thanks,
Mark

Comment by Adrien Jarthon [ 12/Feb/18 ]

And finally to illustrate my point about this issue being present also in other cases, you can find in my last diagnostic.data a couple hours before the collection drop (at 15:35:30 UTC+01) a short lock of 8 sec blocking the entire server again for no apparent reason at all this time, there's no query holding a global lock visible in the slow query log, and the first queries to finish after the lock are benign lookups of single record with perfect index, they don't even show a lock wait time:

2018-02-09T15:35:30.995+0100 I COMMAND  [conn46899743] command database1.$cmd command: update { update: "content_sources", updates: [ { q: { _id: ObjectId('53737c95a90ffb85bd00000d'), _type: "Answers::ContentSource" }, u: { $set: { document_locked: null } }, multi: false, upsert: false } ], ordered: true } numYields:0 reslen:119 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 282669 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_query 287ms
2018-02-09T15:35:30.995+0100 I COMMAND  [conn46899121] command database1.activities command: insert { insert: "red#activities", documents: [ { _id: ObjectId('5a7db1b2920026144d5ec9bd'), start: new Date(1518186670600), end: new Date(1518187055186), user_id: ObjectId('57bad0a714bf8aa377243407'), updated_at: new Date(1518186930671), created_at: new Date(1518186930671) } ], ordered: true } ninserted:1 keysInserted:4 numYields:0 reslen:104 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 321338 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_query 321ms
 
// 8 seconds gap
 
2018-02-09T15:35:38.820+0100 I COMMAND  [conn46899095] command database2.applications command: find { find: "applications", filter: { api_key: "33exxxxxxxxxx724" } } planSummary: IXSCAN { api_key: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:4674 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 7659ms
2018-02-09T15:35:38.820+0100 I COMMAND  [conn46896849] command database2.applications command: find { find: "applications", filter: { api_key: "d64xxxxxxxxxx2b6" } } planSummary: IXSCAN { api_key: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:8302 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 7461ms
2018-02-09T15:35:38.820+0100 I COMMAND  [conn46896799] command database2.applications command: find { find: "applications", filter: { api_key: "d28xxxxxxxxxx6b4" } } planSummary: IXSCAN { api_key: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:8433 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 7235ms
2018-02-09T15:35:38.820+0100 I COMMAND  [conn46898883] command database1.contents command: count { count: "contents", query: { thread_id: ObjectId('5a799a2b14bf8a4444d87ef6'), hidden: { $ne: true } } } planSummary: IXSCAN { thread_id: 1, created_at: 1, _id: 1 } keysExamined:15 docsExamined:15 numYields:1 reslen:44 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 7520ms

We have these kind of locks about 10 times a day.

For the record we tried to restart the secondaries last week to see if it does any good in regard to this issue, and so far it looks like the secondaries did not experience this issue again. Since the restart the number of replication updates on the secondary is way more stable, whereas the number of updates on the primary didn't change:

Not sure if that's of any help but it can at least confirm it's not a machine capacity issue and the mongo server seems to take time (or specific actions?) before entering this state. We do not have any precise point when this problem started of course, the locks just grew longer and more frequent with time.

We'll probably try stepping down a primary to restart it too, if it can at least save a couple days/months without slowdown for our clients. Would you like us to do/check anything specific when we do?

Comment by Adrien Jarthon [ 12/Feb/18 ]

It's worth noting that we have a lot of namespaces (colletions + indexes), I heart some improvements have been made / are in the work about this at MongoDB Europe 2017 ☺

dimelo:PRIMARY> db.stats(1024*1024*1024)
{
	"db" : "database1",
	"collections" : 6864,
	"views" : 0,
	"objects" : 807513628,
	"avgObjSize" : 722.1624915313504,
	"dataSize" : 543.1063971873373,
	"storageSize" : 222.99949645996094,
	"numExtents" : 0,
	"indexes" : 28997,
	"indexSize" : 111.41072845458984,
	"ok" : 1
}

Comment by Adrien Jarthon [ 12/Feb/18 ]

Sure, I've uploaded another diagnostic.data, in this one you can find multiple occurrences, for example last friday (09 feb) I delete 4 EMPTY collections, 2 of them went smoothly as expected, and 2 of them blocked not only the DB but the mongo server entirely for 14 sec and 35 sec !!

Here are the two slow query log lines:

/home/log/mongo/2018/02/09$ zcat mongodb.log.gz | fgrep "2018-02-09T16" | fgrep drop
2018-02-09T16:36:13.696+0100 I COMMAND  [conn46911758] command database.fs.chunks appName: "MongoDB Shell" command: drop { drop: "fs.chunks" } numYields:0 reslen:87 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 1, W: 1 }, acquireWaitCount: { W: 1 }, timeAcquiringMicros: { W: 847 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_command 14367ms
2018-02-09T16:38:52.681+0100 I COMMAND  [conn46911758] command database.multi_templates appName: "MongoDB Shell" command: drop { drop: "multi_templates" } numYields:0 reslen:93 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 1, W: 1 }, acquireWaitCount: { W: 1 }, timeAcquiringMicros: { W: 1659 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_command 35316ms

I uploaded the slow query log for the whole hour on the portal.

Here is how it looks in our web requests queuing graph (time is UTC+1):

Here is the impact on another app in a different DB:

Here is how it looked in my mongo shell:

> db.fs.chunks.count()
0
> db.fs.chunks.drop()
// 14 sec wait
true
> db.multi_templates.count()
0
> db.multi_templates.drop()
// 35 sec wait
true

Comment by Mark Agarunov [ 08/Feb/18 ]

Hello bigbourin@gmail.com,

Thank you for the additional information. We are still investigating this, but if you could provide the logs and/or diagnostic data from the other instances of this happening outside of index builds, it may help correlate what is common between them.

Thanks,
Mark

Comment by Adrien Jarthon [ 08/Feb/18 ]

For the record we're seeing lock like this not only during index creation but also collection creation, deletion and probably other operations. I guess that once the conditions are present, any operation requiring a global or db lock can trigger this issue.

Comment by Adrien Jarthon [ 29/Jan/18 ]

Ok thanks, what's the D marker?
Indeed at E and F only the lag increase, I guess this means the secondary was blocked this time but the primary was ok?

Comment by Mark Agarunov [ 26/Jan/18 ]

Hello bigbourin@gmail.com,

Thank you for providing this data. While we are still investigating this, I've found some correlations in the diagnostic data. At C the createIndex command is issued, soon after latency and queued reads spike, and we stop getting ftdc for a bit while it's blocked. E to F doesn't appear to be the same, as only lag increases while the other metrics stay flat. I will provide further updates as the investigation into this continues.

Thanks,
Mark

Comment by Adrien Jarthon [ 25/Jan/18 ]

Done, I've uploaded the diagnostic.data dir, the mongo log for the day (24/01) and my file containing the raw output of db.currentOp every 5 seconds.

Comment by Kelsey Schubert [ 25/Jan/18 ]

Hi bigbourin@gmail.com,

I've created a secure portal for you to attach files related to this ticket. Would you please upload the diagnostic.data as well as the complete logs covering the same period?

Thank you,
Kelsey

Comment by Adrien Jarthon [ 25/Jan/18 ]

Here a couple more charts about the machine ressources yesterday evening:




Generated at Thu Feb 08 04:31:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.