[SERVER-42493] Replica set crashes Created: 30/Jul/19  Updated: 27/Oct/23  Resolved: 31/Jul/19

Status: Closed
Project: Core Server
Component/s: Replication, Stability
Affects Version/s: 3.4.16
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Stefan Adamov Assignee: Dmitry Agranat
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File rs1-1-diagnostics.tar     HTML File rs1-1-mongo-log     File rs1-2-diagnostics.tar     HTML File rs1-2-mongo-log    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Hey guys, we observed the following weird behaviour with the following setup:

All times are UTC

  • 3-member replica set
    • two bigger instances for failover - rs1-1 and rs1-2
    • one smaller instance for backups
  1. Around 00:31 the primary rs1-1 had a major spike in memory usage.
    • this is inferred from "Cannot allocate memory" messages in the syslog of the instance
    • based on the mongo logs: there are no heavy running queries at the time
  2. After becoming irresponsiveĀ rs1-2 became the new primary and had a similar memory usage spike around 00:37
    • again inferred from the syslog
    • again no big queries can be seen in the mongo log
  3. Both instances were irresponsive (not able to SSH, not reporting metrics) for a few hours until restarting them a few hours later
  4. Upon restartĀ rs1-1 crashed one more time around 06:44
  5. **After the second crash I scaled up the machines and they have been running OK since then

You can see attached:

  • mongo logs from both servers
  • diagnostics.data from both servers

Let me know if you need any more information.



 Comments   
Comment by Dmitry Agranat [ 31/Jul/19 ]

Hi adamof,

After reviewing the provided data, I was not able to find any bottleneck on the MongoDB side. The fact that after scaling up the servers from 7.4GB RAM to 15GB RAM the issue does not reoccur, indicates that it might have been simply an issue with the amount of memory used by the queries while running. I would also recommend reviewing the queries for tuning as I've noticed the workload sometimes scans ~20 million documents just to return ~700.

As I was not able to find an issue on the MongoDB side, I will go ahead and close this ticket.

Thank you,
Dima

Generated at Thu Feb 08 05:00:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.