[SERVER-41066] All cluster PRIMARY mongod was killed by oomkiller in a few seconds Created: 09/May/19  Updated: 24/Jun/19  Resolved: 24/Jun/19

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: 4.0.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Artem Assignee: Eric Sedor
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

We had several incidents: all heavy loaded cluster PRIMARY mongod was killed by oomkiller in a few seconds.

On the graph, it looked like mongod process doubled memory usage from 64GB in a few seconds.

During the investigation, we found in the logs lines like:

2019-04-23T20:39:33.160+0000 E -        [conn1790578] Assertion: BSONObjectTooLarge: BSONObj size: 66053215 (0x3EFE45F) is invalid. Size must be between 0 and 16793600(16MB) First element: stage: "OR" src/mongo/bson/bsonobj.cpp 102

and:

2019-04-23T20:39:33.283+0000 I COMMAND  [conn1790578] warning: log line attempted (11390kB) over max size (10kB), printing beginning and end

We have limited the batch size in one of the our service tasks, reducing the BSON size and the problem was solved.

We also found in the logs a few spikes in memory consumption with similar symptoms that did not cause to oomkiller: it does not look like a memory leak, since memory consumption has returned to its normal usage.

In terms of mongod behavior, I am concerned about the following points:

  • Assertion: BSONObjectTooLarge - it seems to me that the size check occurs after it is loaded into memory;
  • log line attempted (...) over max size - it looks like the whole line is formed, although only 10kB is printed.


 Comments   
Comment by Eric Sedor [ 24/Jun/19 ]

Hi,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Sincerely,
Eric

Comment by Eric Sedor [ 03/Jun/19 ]

Hi bozaro

We still need additional information to diagnose the problem. If this is still an issue for you, would you please upload the log files and the $dbpath/diagnostic.data directory covering one or more of these incidents to this secure private portal, and provide a timeline for the incident(s)?

Thanks in advance.

Comment by Bruce Lucas (Inactive) [ 09/May/19 ]

bozaro, so that we can investigate these issues further, would you be able to upload the log files and the $dbpath/diagnostic.data directory covering one or more of these incidents? You can upload them to this secure private portal. Also please tell us the timeline of the incident(s) covered by the data.

Generated at Thu Feb 08 04:56:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.