Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-32156

Linear field lookup in update subsystem can lead to an individual update having quadratic time complexity

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: 3.2.16, 3.4.6, 3.6.0
    • Component/s: Performance
    • Query Execution
    • ALL
    • Hide

      If required, we can provide a mongodump of the collection before it was removed, and some example queries to run against it which seem to hang the server with high CPU and cannot be killed in a timely manner. The dump is about 400MB compressed though.

      Simply doing:

      db.Location_Daily.find(

      {"keys.floorId": "1000000409", "timestamp": 2017120200}

      );

      would cause it to hang with high CPU for example. Again there is a unique index on those two attributes.

      Show
      If required, we can provide a mongodump of the collection before it was removed, and some example queries to run against it which seem to hang the server with high CPU and cannot be killed in a timely manner. The dump is about 400MB compressed though. Simply doing: db.Location_Daily.find( {"keys.floorId": "1000000409", "timestamp": 2017120200} ); would cause it to hang with high CPU for example. Again there is a unique index on those two attributes.

      We had some erroneous data enter our system and get written to a mongo collection. It resulted in a document containing over 600,000 attributes in it. When a find or update was attempted on this document the mongo server master went to 200% CPU and all other work ground to a halt. We attempted to kill the operation but it would not stop. It seemed that the operation was holding locks and was not yielding. Here is a sample of what the currentOp() looked like for the operation:

      {
                  "desc" : "conn17",
                  "threadId" : "140604867651328",
                  "connectionId" : 17,
                  "client" : "172.31.26.193:55908",
                  "active" : true,
                  "opid" : 147314,
                  "killPending" : true,
                  "secs_running" : 532,
                  "microsecs_running" : NumberLong(532401420),
                  "op" : "update",
                  "ns" : "skyfii-brazil-reporting.Location_Daily",
                  "query" : {
                      "keys.floorId" : "1000000409",
                      "timestamp" : NumberLong(2017120200)
                  },
                  "numYields" : 0,
                  "locks" : {
                      "Global" : "w",
                      "Database" : "w",
                      "Collection" : "w"
                  },
                  "waitingForLock" : false,
                  "lockStats" : {
                      "Global" : {
                          "acquireCount" : {
                              "r" : NumberLong(1),
                              "w" : NumberLong(1)
                          }
                      },
                      "Database" : {
                          "acquireCount" : {
                              "w" : NumberLong(1)
                          }
                      },
                      "Collection" : {
                          "acquireCount" : {
                              "w" : NumberLong(1)
                          }
                      }
                  }
              }

      Note that we have a unique index on the two attributes mentioned in the query so normally these operations take virtually no time.

      In the end we had to reboot the server several times, delete the bad data, and filter the bad data before it was written. We are not so concerned that a large document could not be updated/read. Rather the real concern is that:

      • other collections also could not be updated at the same time which meant we lost some important data
      • we could not terminate the operation without a reboot or kill -9, and
      • at one stage we could not connect a client to the server as it was timing out

      A bit of large (but not humongous) data in one collection should not bring down a whole database.

      Here's is an idea of what one of the large documents looks like:

      {
              "_id": {
                      "$oid": "5a2219d4892f00b521c545a5"
              },
              "keys": {
                      "floorId": "1000000409"
              },
              "timestamp": {
                      "$numberLong": "2017120200"
              },
              "counters": {
                      "-5037105:2680680": 1,
                      "-28830090:15338265": 1,
                      "1050:330": 93,
                      "-15932715:8477040": 1,
                      "1155:300": 2421,
                      "-16772700:8923710": 1,
                      "-27971520:14881500": 1,
                      "-23240144:12364485": 1,
                      "-15001215:7981305": 1,
                      "-29534580:15713025": 1,
                      "-5454135:2902620": 1,
                      "-11691045:6220425": 1,
                      "-1019970:543465": 1,
                      "-4189770:2229870": 1,
                      "-5274525:2806665": 1,
                      "360:405": 1124,
                      "-1271880:677610": 1,
                      "-15419295:8203785": 1,
                      "-12914220:6871020": 1,
      .... 600,000-odd entries later ....
         }
      }
      

      Background: It's community edition mongo 3.2.16. We briefly tried on a more recent 3.4.9 locally and the same problem happens on that version too. We are running a 3 data node replica set in production managed by mongo cloud.

        1. diag.tgz
          119.77 MB
        2. issue-scripts.tgz
          5.99 MB

            Assignee:
            Unassigned Unassigned
            Reporter:
            craig.leyshan@skyfii.com Craig Leyshan
            Votes:
            1 Vote for this issue
            Watchers:
            22 Start watching this issue

              Created:
              Updated: