[SERVER-63984] Primary replica member becomes unavailable during normal operation Created: 25/Feb/22  Updated: 27/Oct/23  Resolved: 23/Jun/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vladimir Beliakov Assignee: Dmitry Agranat
Resolution: Community Answered Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 16.04
XSF
Kernel - 4.4.0-1128-aws #142-Ubuntu SMP Fri Apr 16 12:42:33 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Disable Transparent Huge disabled
AWS m5.large (2cpu\8gb)
SSD GP3 450 Gb
monogo-org-server - 4.2.17


Attachments: Zip Archive data.diagnostics.zip     Zip Archive diagnostic.data-1.zip     Zip Archive diagnostic.data-2.zip     Zip Archive diagnostic.data.zip     HTML File gdb.html     Text File gdb_2022-04-12_14-15-43.txt     Zip Archive gdb_2022-05-12_09-59-24.zip     Zip Archive metrics.zip    
Operating System: ALL
Participants:

 Description   

For no apparent reason, our primary replica member of one of the shards got unresponsive until we restarted it.

The incident lasted for about 35 minutes. During that time we saw almost 100% consumption on the primary and its load average was up to 60 times the normal values.

From the logs (as of the beginning of the incident) we understood only this:
1) The amount of opened connections started increasing.
2) Some opened cursors got timed out.
3) The pooled connections to other members got dropped (due to shutdown, but we didn't try to shut down the primary at that time).
```
I CONNPOOL [TaskExecutorPool-0] Dropping all pooled connections to some-secondary:27017 due to ShutdownInProgress: Pool for some-secondary:27017 has expired.
```
4) After some time no log entries appear (for about 25 minutes) until we restarted the primary.

Our cluster configuration:

  • shard cluster with 10 shards
  • four replicas in each shard
  • about 400 GB of data in storage size per shard

Replica server configuration:

  • Ubuntu 16.04
  • XSF
  • Kernel - 4.4.0-1128-aws #142-Ubuntu SMP Fri Apr 16 12:42:33 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Disable Transparent Huge disabled
  • AWS m5.large (2cpu\8gb)
  • SSD GP3 450 Gb
  • monogo-org-server - 4.2.17

`diagnostic.data` of the primary and one of the secondary attached to the post.



 Comments   
Comment by Vladimir Beliakov [ 14/Jul/22 ]

I suspect this has something to do with balancing chunks. We're using `sh.addTagRange` to set custom chunk ranges. Yesterday we changed the ranges and the balancer started moving chunks. The incident happened 15-20 minutes afterwards. 

Comment by Vladimir Beliakov [ 13/Jul/22 ]

I found this message in the logs

STORAGE  [FlowControlRefresher] Flow control is engaged and the sustainer point is not moving. Please check the health of all secondaries.

 

which led me to this issue. Might be related.

Comment by Vladimir Beliakov [ 13/Jul/22 ]

dmitry.agranat@mongodb.com
We ran into a somewhat similar issue but the behavior of the primary was a bit different. The transactions may not be to blame. So, I'm kindly asking to analyze the diagnostic data to help us understand what was the culprit.
diagnostic.data.zip

Comment by Dmitry Agranat [ 09/Jun/22 ]

Thank you vladimirred456@gmail.com, after looking at the latest stack traces from May 12th, this is indeed what we've suspected from the start: when a dirty portion of the WT data gets to 20%, we are unable to clean any of this dirty data and, under certain conditions, transactions can get stuck.

There are a few steps you can take to address this issue:

  • Most effective way to avoid this issue is to diagnose why a transaction pinning so much dirty data and try to fix this on the application side. In other words, make sure your transactions (that change data) are short and do not need to "dirty" more that the 20% of the WT cache (in your, case, more than ~600MB)
  • Increase your memory to an extent where your transactions (that change data) do not reach filling out 20% of the WT cache in a single transaction
  • Upgrade to MongoDB 5.0 where some significant improvements were made to improve this area
Comment by Vladimir Beliakov [ 03/Jun/22 ]

Hi dmitry.agranat@mongodb.com,

We one more time had approximately the same issue. However, this time the machine became unresponsive, so we couldn't collect the stack traces, only diagnostic data (attached).
The incident started at about 06/03/2022 05:29:00 UTC.
diagnostic.data.zip

Comment by Vladimir Beliakov [ 12/May/22 ]

Hi dmitry.agranat@mongodb.com,

We ran into the same issue and were able to collect the stack traces during the incident. You can find them and the diagnostic.data attached to this message.
data.diagnostics.zip
gdb_2022-05-12_09-59-24.zip

Comment by Vladimir Beliakov [ 18/Apr/22 ]

dmitry.agranat@mongodb.com thanks for your answer. I'll keep that in mind.

Comment by Dmitry Agranat [ 18/Apr/22 ]

Thanks for the clarification vladimirred456@gmail.com. Since the stacks were collected only after the node was stepped down, we no longer see the reason why the previous member state was unavailable. In order for us to get to the bottom of this issue, we need stack traces collected on the relevant node while the issue is still in progress.

Comment by Vladimir Beliakov [ 18/Apr/22 ]

dmitry.agranat@mongodb.com, we had to make that member step down during the incident and we collected the stack traces right after that. Here's diagnostic.data
metrics.zip
glibc version:
ldd (Ubuntu GLIBC 2.23-0ubuntu11.3) 2.23
Copyright (C) 2016 Free Software Foundation, Inc.
 

Comment by Dmitry Agranat [ 17/Apr/22 ]

Thanks vladimirred456@gmail.com, do you also have diagnostic.data covering the time of the latest event that happened on April 12th? I am asking this because based on the stack traces provided, the state of the member does not look to be Primary. Can you please confirm this?

In addition, could you please post your current glibc version?

Comment by Vladimir Beliakov [ 12/Apr/22 ]

Hi, Edwin Zhou!

We had a somewhat similar problem. The required stack traces file is attached. I hope that will be helpful.

Cheers!
gdb_2022-04-12_14-15-43.txt

Comment by Edwin Zhou [ 14/Mar/22 ]

Thank you for your follow up, vladimirred456@gmail.com. I will leave this on waiting for user input as we await the stack traces from a repeat occurrence.

Best,
Edwin

Comment by Vladimir Beliakov [ 11/Mar/22 ]

Hi, edwin.zhou!

Yes, I already made our devops team aware of collecting the stack traces in case we have the same problem. But no similar incidents have happened yet. When we have the stack traces I'll attach them to the ticker ASAP.

 

Thank you for your help!

Comment by Edwin Zhou [ 10/Mar/22 ]

Hi vladimirred456@gmail.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please collect the gdb stack traces and upload them to this ticket?

Best,
Edwin

Comment by Dmitry Agranat [ 28/Feb/22 ]

Hi vladimirred456@gmail.com, after looking at the data you've practically provided (thank you for that), I suspect this might be related to a particular situation where a process can get stuck with long-running transactions not aborting. This usually happens when a dirty portion of the WT data gets to 20%, we are unable to clean any of this dirty data and, under certain conditions, transactions can get stuck.

In order to confirm or rule this out, we'll need to collect some additional information if this issue occurs again. After the process gets stuck and before rebooting the process, please execute this command:

gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' >gdb_`date +"%Y-%m-%d_%H-%M-%S"`.txt

This will collect stack traces showing where threads are being stuck. Please test gdb command to make sure it works as expected if/when the issue occurs again.

Regards,
Dima

Generated at Thu Feb 08 05:59:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.