Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: 3.2.16, 3.4.6, 3.6.0
Component/s: Performance
Labels:
- query-44-grooming

Assigned Teams:

Query Execution
Operating System:
ALL
Steps To Reproduce:

Hide

If required, we can provide a mongodump of the collection before it was removed, and some example queries to run against it which seem to hang the server with high CPU and cannot be killed in a timely manner. The dump is about 400MB compressed though.

Simply doing:

db.Location_Daily.find(
{"keys.floorId": "1000000409", "timestamp": 2017120200}
);

would cause it to hang with high CPU for example. Again there is a unique index on those two attributes.

Show
If required, we can provide a mongodump of the collection before it was removed, and some example queries to run against it which seem to hang the server with high CPU and cannot be killed in a timely manner. The dump is about 400MB compressed though. Simply doing: db.Location_Daily.find( {"keys.floorId": "1000000409", "timestamp": 2017120200} ); would cause it to hang with high CPU for example. Again there is a unique index on those two attributes.
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We had some erroneous data enter our system and get written to a mongo collection. It resulted in a document containing over 600,000 attributes in it. When a find or update was attempted on this document the mongo server master went to 200% CPU and all other work ground to a halt. We attempted to kill the operation but it would not stop. It seemed that the operation was holding locks and was not yielding. Here is a sample of what the currentOp() looked like for the operation:

{
            "desc" : "conn17",
            "threadId" : "140604867651328",
            "connectionId" : 17,
            "client" : "172.31.26.193:55908",
            "active" : true,
            "opid" : 147314,
            "killPending" : true,
            "secs_running" : 532,
            "microsecs_running" : NumberLong(532401420),
            "op" : "update",
            "ns" : "skyfii-brazil-reporting.Location_Daily",
            "query" : {
                "keys.floorId" : "1000000409",
                "timestamp" : NumberLong(2017120200)
            },
            "numYields" : 0,
            "locks" : {
                "Global" : "w",
                "Database" : "w",
                "Collection" : "w"
            },
            "waitingForLock" : false,
            "lockStats" : {
                "Global" : {
                    "acquireCount" : {
                        "r" : NumberLong(1),
                        "w" : NumberLong(1)
                    }
                },
                "Database" : {
                    "acquireCount" : {
                        "w" : NumberLong(1)
                    }
                },
                "Collection" : {
                    "acquireCount" : {
                        "w" : NumberLong(1)
                    }
                }
            }
        }

Note that we have a unique index on the two attributes mentioned in the query so normally these operations take virtually no time.

In the end we had to reboot the server several times, delete the bad data, and filter the bad data before it was written. We are not so concerned that a large document could not be updated/read. Rather the real concern is that:

other collections also could not be updated at the same time which meant we lost some important data
we could not terminate the operation without a reboot or kill -9, and
at one stage we could not connect a client to the server as it was timing out

A bit of large (but not humongous) data in one collection should not bring down a whole database.

Here's is an idea of what one of the large documents looks like:

{
        "_id": {
                "$oid": "5a2219d4892f00b521c545a5"
        },
        "keys": {
                "floorId": "1000000409"
        },
        "timestamp": {
                "$numberLong": "2017120200"
        },
        "counters": {
                "-5037105:2680680": 1,
                "-28830090:15338265": 1,
                "1050:330": 93,
                "-15932715:8477040": 1,
                "1155:300": 2421,
                "-16772700:8923710": 1,
                "-27971520:14881500": 1,
                "-23240144:12364485": 1,
                "-15001215:7981305": 1,
                "-29534580:15713025": 1,
                "-5454135:2902620": 1,
                "-11691045:6220425": 1,
                "-1019970:543465": 1,
                "-4189770:2229870": 1,
                "-5274525:2806665": 1,
                "360:405": 1124,
                "-1271880:677610": 1,
                "-15419295:8203785": 1,
                "-12914220:6871020": 1,
.... 600,000-odd entries later ....
   }
}

Background: It's community edition mongo 3.2.16. We briefly tried on a more recent 3.4.9 locally and the same problem happens on that version too. We are running a 3 data node replica set in production managed by mongo cloud.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

diag.tgz
Dec 04 2017 11:23:36 PM UTC
119.77 MB
Craig Leyshan
issue-scripts.tgz
Dec 05 2017 03:44:19 AM UTC
5.99 MB
Craig Leyshan

is duplicated by

SERVER-40043 $set takes quadratic time while holding locks

Closed

is related to

SERVER-36791 $addToSet with $each takes quadratic time while holding locks

Backlog

SERVER-40043 $set takes quadratic time while holding locks

Closed

Assignee:: Unassigned
Reporter:: Craig Leyshan
Participants:: Asya Kamsky, Bruce Lucas, Craig Leyshan, Justin Seyster, Mihai Andrei, Ramon Fernandez Marina
Votes:: 1 Vote for this issue
Watchers:: 25 Start watching this issue

Created:: Dec 04 2017 12:47:20 AM UTC
Updated:: Mar 24 2025 05:09:28 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates