[SERVER-28172] Renaming collection causes permanent performance issue on secondary Created: 02/Mar/17  Updated: 31/May/17  Resolved: 05/May/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.2
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Marc Tinkler Assignee: Mark Agarunov
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2017-03-02 at 11.32.19 AM.png     PNG File Screen Shot 2017-03-02 at 11.34.55 AM.png     PNG File Screen Shot 2017-03-02 at 9.34.42 PM.png     File bad-mongod.log.gz    
Issue Links:
Duplicate
duplicates WT-3115 Change the dhandle lock to a read/wri... Closed
Related
is related to WT-1228 Improve performance of WT_SESSION::op... Closed
Operating System: ALL
Steps To Reproduce:

Here's the rough pseudocode to reproduce:

List<DBObject> rows = doBigAggregation(); // this aggregation uses disk, is in cursor mode, and returns roughly 35,000 rows
DBCollection temp = db.getCollection("schoolusage_temp");
temp.drop(); // just in case this process failed last night
BulkWriteOperation bulkwrite = temp.initializeOrderedBulkOperation();
for (DBObject row:rows) bulkwrite.insert(u); // in reality we insert in batches of 250
bulkwrite.execute();
temp.rename("schoolusage", true);

Participants:

 Description   

Every night, we do an aggregation, followed by a bulk insert, followed by a rename with the "drop target option".

Without fail, roughly one second (the time it takes for the rename to propagate to the secondary) after the rename completes, our secondary goes completely haywire and performance goes out the window. Performance does not recover unless we restart the secondary.

This behavior did not occur until we upgraded from 3.2.10 to 3.4.2

Once the performance drops, every couple of seconds we'll see hundreds of queries piling up for very simple queries. They all block and then complete simultaneously. It looks as though it's some sort of concurrency issue.

+Included are some graphs and logs that illustrate the problem. +

Server restarted at 22:00 and aggregation happened at 00:00:

Same period in MMS:

Here are some very simple queries taking over 6 seconds to complete:

2017-03-02T16:02:32.034+0000 I COMMAND  [conn1328] command thinkmap.gamequestion command: find { find: "gamequestion", filter: { _id: 163258 }, projection: { workflow-writing: 0, votes: 0, ag: 0 }, limit: 1, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:740 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 6288267 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 6356ms
2017-03-02T16:02:32.034+0000 I COMMAND  [conn1289] command thinkmap.gamequestion command: find { find: "gamequestion", filter: { _id: 300683 }, projection: { workflow-writing: 0, votes: 0, ag: 0 }, limit: 1, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:642 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 6378585 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 6443ms
2017-03-02T16:02:32.034+0000 I COMMAND  [conn1166] command thinkmap.gamequestion command: find { find: "gamequestion", filter: { _id: 131553 }, projection: { workflow-writing: 0, votes: 0, ag: 0 }, limit: 1, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:526 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 6555928 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 6623ms
2017-03-02T16:02:32.034+0000 I COMMAND  [conn1369] command thinkmap.gamequestion command: find { find: "gamequestion", filter: { _id: 212580 }, projection: { workflow-writing: 0, votes: 0, ag: 0 }, limit: 1, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:683 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 6193103 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 6257ms



 Comments   
Comment by Marc Tinkler [ 05/May/17 ]

Thomas,

We've upgraded and everything is working again. Thanks!

Comment by Kelsey Schubert [ 04/May/17 ]

Hi tinkler@vocabulary.com,

We've also backported WT-3115 to MongoDB 3.2.13, which was released earlier this week. Would you please upgrade to the latest version of MongoDB 3.2 or 3.4 and let us know if the issue persists?

Thank you,
Thomas

Comment by Mark Agarunov [ 29/Mar/17 ]

Hello tinkler@vocabulary.com,

Thank you for providing the files. Looking over the contents, I believe the behavior you're seeing may be related to the issue described in WT-3115, which has been fixed in the latest release. If possible, please upgrade to the latest version (3.4.3) of MongoDB to see if this fixes the performance issues you are seeing.

Thanks,
Mark

Comment by Marc Tinkler [ 03/Mar/17 ]

Thanks for the clarification Mark, I uploaded those logs.

Comment by Mark Agarunov [ 03/Mar/17 ]

Hello tinkler@vocabulary.com,

Thank you for providing the log. I've created a secure upload portal for you to use. However please note that the diagnostic.data directory does not contain any of your data. It periodically collects the output of the following commands, which you are welcome to execute yourself to examine the output:

serverStatus: db.serverStatus({tcmalloc: true})
replSetGetStatus: rs.status()
collStats for local.oplog.rs: db.getSiblingDB('local').oplog.rs.stats()
getCmdLineOpts: db.adminCommand({getCmdLineOpts: true})
buildInfo: db.adminCommand({buildInfo: true})
hostInfo: db.adminCommand({hostInfo: true})

Thanks,
Mark

Comment by Marc Tinkler [ 03/Mar/17 ]

Here is the portion of the log from right before the problem happened until reboot.

I can also provide the diagnostic.data, but it's over 150mb gzipped and I'd prefer to provide it to you securely. Please let me know how you'd like me to do that.

Comment by Marc Tinkler [ 03/Mar/17 ]

Here's another screen shot of MMS after it happened again. I will get you logs for this period and attach them to this issue.

This is connections and Operation execution times

Comment by Marc Tinkler [ 03/Mar/17 ]

Hey Mark,

That's not exactly it. I don't know if you can reproduce these conditions in house, but under these circumstances, we can very reliably reproduce it. We are happy to host a screen share with you so you can see it happen in real time. Here's what you need to do:

  • create a 3 member replica set with 1 primary, 1 secondary and 1 hidden secondary
  • ensure that thousands of concurrent queries (and aggregations) are hitting both the primary and the secondary, with clients connecting and disconnecting during the test
  • create a collection test1, insert 35,000 documents
  • create a collection test2, insert 35,000 documents
  • run db.test2.renameTo('test1',true)

I will get you the logs and diagnostics ASAP.

Comment by Mark Agarunov [ 02/Mar/17 ]

Hello tinkler@vocabulary.com,

Thank you for the report. Unfortunately, we're having difficulty reproducing the behavior you are describing. To attempt to reproduce, I followed the following steps, please let me know if something was missed:

  • Create a new collection test1, insert 35,000 documents.
  • Create a second collection, test2
  • Drop test1 and rename test2 to test1
  • Insert 35,000 documents into the renamed test1
  • Continue inserting documents to the collection on the primary
  • Repeatedly run find queries on the secondary against this collection.

Additionally, to get a better idea of what may be causing the issue, please provide the following:

  • The complete logs from mongod, both the primary and the secondary when this issue is present.
  • Please archive (tar or zip) the $dbpath/diagnostic.data directory on the secondary and attach it to this ticket.

Thanks,
Mark

Comment by Marc Tinkler [ 02/Mar/17 ]

Here's another tidbit of information: it seems we are able to reproduce the same problem by simply dropping a collection. But it's hit-or-miss, sometime it happens, and sometimes not.

Also, it does not seem to be related to the size of the collection we are dropping. We were able to reproduce it by dropping a collection with a single document.

Generated at Thu Feb 08 04:17:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.