[SERVER-63984] Primary replica member becomes unavailable during normal operation Created: 25/Feb/22 Updated: 27/Oct/23 Resolved: 23/Jun/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Vladimir Beliakov | Assignee: | Dmitry Agranat |
| Resolution: | Community Answered | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 16.04 |
||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
For no apparent reason, our primary replica member of one of the shards got unresponsive until we restarted it. The incident lasted for about 35 minutes. During that time we saw almost 100% consumption on the primary and its load average was up to 60 times the normal values. From the logs (as of the beginning of the incident) we understood only this: Our cluster configuration:
Replica server configuration:
`diagnostic.data` of the primary and one of the secondary attached to the post. |
| Comments |
| Comment by Vladimir Beliakov [ 14/Jul/22 ] | |
|
I suspect this has something to do with balancing chunks. We're using `sh.addTagRange` to set custom chunk ranges. Yesterday we changed the ranges and the balancer started moving chunks. The incident happened 15-20 minutes afterwards. | |
| Comment by Vladimir Beliakov [ 13/Jul/22 ] | |
|
I found this message in the logs
which led me to this issue. Might be related. | |
| Comment by Vladimir Beliakov [ 13/Jul/22 ] | |
|
dmitry.agranat@mongodb.com | |
| Comment by Dmitry Agranat [ 09/Jun/22 ] | |
|
Thank you vladimirred456@gmail.com, after looking at the latest stack traces from May 12th, this is indeed what we've suspected from the start: when a dirty portion of the WT data gets to 20%, we are unable to clean any of this dirty data and, under certain conditions, transactions can get stuck. There are a few steps you can take to address this issue:
| |
| Comment by Vladimir Beliakov [ 03/Jun/22 ] | |
|
Hi dmitry.agranat@mongodb.com, We one more time had approximately the same issue. However, this time the machine became unresponsive, so we couldn't collect the stack traces, only diagnostic data (attached). | |
| Comment by Vladimir Beliakov [ 12/May/22 ] | |
|
Hi dmitry.agranat@mongodb.com, We ran into the same issue and were able to collect the stack traces during the incident. You can find them and the diagnostic.data attached to this message. | |
| Comment by Vladimir Beliakov [ 18/Apr/22 ] | |
|
dmitry.agranat@mongodb.com thanks for your answer. I'll keep that in mind. | |
| Comment by Dmitry Agranat [ 18/Apr/22 ] | |
|
Thanks for the clarification vladimirred456@gmail.com. Since the stacks were collected only after the node was stepped down, we no longer see the reason why the previous member state was unavailable. In order for us to get to the bottom of this issue, we need stack traces collected on the relevant node while the issue is still in progress. | |
| Comment by Vladimir Beliakov [ 18/Apr/22 ] | |
|
dmitry.agranat@mongodb.com, we had to make that member step down during the incident and we collected the stack traces right after that. Here's diagnostic.data | |
| Comment by Dmitry Agranat [ 17/Apr/22 ] | |
|
Thanks vladimirred456@gmail.com, do you also have diagnostic.data covering the time of the latest event that happened on April 12th? I am asking this because based on the stack traces provided, the state of the member does not look to be Primary. Can you please confirm this? In addition, could you please post your current glibc version? | |
| Comment by Vladimir Beliakov [ 12/Apr/22 ] | |
|
Hi, Edwin Zhou! We had a somewhat similar problem. The required stack traces file is attached. I hope that will be helpful. Cheers! | |
| Comment by Edwin Zhou [ 14/Mar/22 ] | |
|
Thank you for your follow up, vladimirred456@gmail.com. I will leave this on waiting for user input as we await the stack traces from a repeat occurrence. Best, | |
| Comment by Vladimir Beliakov [ 11/Mar/22 ] | |
|
Hi, edwin.zhou! Yes, I already made our devops team aware of collecting the stack traces in case we have the same problem. But no similar incidents have happened yet. When we have the stack traces I'll attach them to the ticker ASAP.
Thank you for your help! | |
| Comment by Edwin Zhou [ 10/Mar/22 ] | |
|
We still need additional information to diagnose the problem. If this is still an issue for you, would you please collect the gdb stack traces and upload them to this ticket? Best, | |
| Comment by Dmitry Agranat [ 28/Feb/22 ] | |
|
Hi vladimirred456@gmail.com, after looking at the data you've practically provided (thank you for that), I suspect this might be related to a particular situation where a process can get stuck with long-running transactions not aborting. This usually happens when a dirty portion of the WT data gets to 20%, we are unable to clean any of this dirty data and, under certain conditions, transactions can get stuck. In order to confirm or rule this out, we'll need to collect some additional information if this issue occurs again. After the process gets stuck and before rebooting the process, please execute this command:
This will collect stack traces showing where threads are being stuck. Please test gdb command to make sure it works as expected if/when the issue occurs again. Regards, |