Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28172

Renaming collection causes permanent performance issue on secondary

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Duplicate
    • Affects Version/s: 3.4.2
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      Here's the rough pseudocode to reproduce:

      List<DBObject> rows = doBigAggregation(); // this aggregation uses disk, is in cursor mode, and returns roughly 35,000 rows
      DBCollection temp = db.getCollection("schoolusage_temp");
      temp.drop(); // just in case this process failed last night
      BulkWriteOperation bulkwrite = temp.initializeOrderedBulkOperation();
      for (DBObject row:rows) bulkwrite.insert(u); // in reality we insert in batches of 250
      bulkwrite.execute();
      temp.rename("schoolusage", true);
      

      Show
      Here's the rough pseudocode to reproduce: List<DBObject> rows = doBigAggregation(); // this aggregation uses disk, is in cursor mode, and returns roughly 35,000 rows DBCollection temp = db.getCollection("schoolusage_temp"); temp.drop(); // just in case this process failed last night BulkWriteOperation bulkwrite = temp.initializeOrderedBulkOperation(); for (DBObject row:rows) bulkwrite.insert(u); // in reality we insert in batches of 250 bulkwrite.execute(); temp.rename("schoolusage", true);

      Description

      Every night, we do an aggregation, followed by a bulk insert, followed by a rename with the "drop target option".

      Without fail, roughly one second (the time it takes for the rename to propagate to the secondary) after the rename completes, our secondary goes completely haywire and performance goes out the window. Performance does not recover unless we restart the secondary.

      This behavior did not occur until we upgraded from 3.2.10 to 3.4.2

      Once the performance drops, every couple of seconds we'll see hundreds of queries piling up for very simple queries. They all block and then complete simultaneously. It looks as though it's some sort of concurrency issue.

      +Included are some graphs and logs that illustrate the problem. +

      Server restarted at 22:00 and aggregation happened at 00:00:

      Same period in MMS:

      Here are some very simple queries taking over 6 seconds to complete:

      2017-03-02T16:02:32.034+0000 I COMMAND  [conn1328] command thinkmap.gamequestion command: find { find: "gamequestion", filter: { _id: 163258 }, projection: { workflow-writing: 0, votes: 0, ag: 0 }, limit: 1, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:740 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 6288267 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 6356ms
      2017-03-02T16:02:32.034+0000 I COMMAND  [conn1289] command thinkmap.gamequestion command: find { find: "gamequestion", filter: { _id: 300683 }, projection: { workflow-writing: 0, votes: 0, ag: 0 }, limit: 1, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:642 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 6378585 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 6443ms
      2017-03-02T16:02:32.034+0000 I COMMAND  [conn1166] command thinkmap.gamequestion command: find { find: "gamequestion", filter: { _id: 131553 }, projection: { workflow-writing: 0, votes: 0, ag: 0 }, limit: 1, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:526 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 6555928 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 6623ms
      2017-03-02T16:02:32.034+0000 I COMMAND  [conn1369] command thinkmap.gamequestion command: find { find: "gamequestion", filter: { _id: 212580 }, projection: { workflow-writing: 0, votes: 0, ag: 0 }, limit: 1, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:683 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 6193103 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 6257ms
      

        Attachments

        1. bad-mongod.log.gz
          5.64 MB
        2. Screen Shot 2017-03-02 at 11.32.19 AM.png
          Screen Shot 2017-03-02 at 11.32.19 AM.png
          94 kB
        3. Screen Shot 2017-03-02 at 11.34.55 AM.png
          Screen Shot 2017-03-02 at 11.34.55 AM.png
          137 kB
        4. Screen Shot 2017-03-02 at 9.34.42 PM.png
          Screen Shot 2017-03-02 at 9.34.42 PM.png
          505 kB

          Issue Links

            Activity

              People

              Assignee:
              mark.agarunov Mark Agarunov
              Reporter:
              tinkler@vocabulary.com Marc Tinkler
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: