[SERVER-45315] When replica-set member goes offline, other members CPU`s spike to 100% Created: 29/Dec/19  Updated: 27/Oct/23  Resolved: 13/Jan/20

Status: Closed
Project: Core Server
Component/s: Performance, Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Arik Nano Assignee: Dmitry Agranat
Resolution: Community Answered Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

TL;DR: 

Powering off one of mongodb Shard members cause the others cpu`s to raise for 100%.

Background: 

I want to deploy an mongodb cluster on several ESXes. The cluster have to resist two component shutdown.

Cluster Architecture (Mongo 4.2):

  • 5 config servers
  • 3 query servers
  • shard01:
    • primary
    • 2 secondary
    • 2 arbiter
  • shard02:
    • primary
    • 2 secondary
    • 2 arbiter

The problem:

Whenever I have been testing HA by removing one of the members. I noticed, after several minutes, that the rest of the members face to CPU spike to 100% which remains until I returned the missing member.

Tests I have been conducted:

  1. shutdown 1 replica -> members CPU raise to 100%
  2. shutdown 1 replica and 1 arbiter -> members CPU raise to 100%
  3. shutdown 1 arbiter -> members are OK

Things i have already checked:

  • When checking the problematic VMs I noticed that the mongod is the service which consume most of the CPU (99%).
  • I checked mongod for long run-time queries with db.currentOp(). Everything looks just fine.
  • Mongod.log does not contain any suspicious logs.

Bbottom_line:

I published the problem in [stackoverflow |https://stackoverflow.com/questions/59491006/why-when-one-of-mongodb-replica-set-shard-members-goes-offline-the-others-cpus] and advised to report it as a bug. 

Regards,

Aric



 Comments   
Comment by Dmitry Agranat [ 13/Jan/20 ]

Hi naheim.lavon@opka.org,

I will go ahead and close this case. Do not hesitate to reach out if you still face the same issue after implementing the recommendations from my last comment.

Regards,
Dima

Comment by Dmitry Agranat [ 30/Dec/19 ]

Hi naheim.lavon@opka.org, your first 2 tests basically turn your replica set into a deployment we do not recommend. To address this, you should either disable read concern majority or do not use Arbiters in your deployment.

If after the above suggestion you still see the same issue, please let us know and we'll send you a link to a secure portal to collect some data.

Thanks,
Dima

Generated at Thu Feb 08 05:08:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.