[SERVER-68124] Primary replica member hangs during chunk migrating Created: 19/Jul/22  Updated: 11/Oct/22  Resolved: 11/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.17
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vladimir Beliakov Assignee: Chris Kelly
Resolution: Done Votes: 0
Labels: balancing
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 16.04
XSF
Kernel - 4.4.0-1128-aws #142-Ubuntu SMP Fri Apr 16 12:42:33 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Disable Transparent Huge disabled
AWS m5.large (2cpu\8gb)
SSD GP3 450 Gb
monogo-org-server - 4.2.17


Attachments: Zip Archive diagnostic.data.zip     PNG File image-2022-09-09-05-44-22-482.png    
Operating System: ALL
Participants:

 Description   

We're using `sh.addTagRange` to set custom chunk ranges. And usually the ranges are fixed and the chunks are not moved between the shards. However, the other day we needed to change the ranges which made the balancer move chunks. Within 15-20 minutes afterwards the primary of some of the shards got unresponsive and the whole shard cluster hang.

diagnostic.data.zip of the hang primary is attached.

Also, I found this message in the logs

STORAGE  [FlowControlRefresher] Flow control is engaged and the sustainer point is not moving. Please check the health of all secondaries.

 

which led me to this issue. Might be related.



 Comments   
Comment by Chris Kelly [ 11/Oct/22 ]

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Comment by Chris Kelly [ 27/Sep/22 ]

Hi Vladimir,

Just a reminder, in order to investigate this further we need additional information. Specifically, we are missing:

  • A specific timeline of what is happening on your end (with timestamps)
  • The mongod logs associated with the event (if they still exist)

We are missing indicators that relate your submitted diagnostic data to SERVER-53738. Specifically, moving chunks doesn't appear to happen in your diagnostic data at all in the time range of your diagnostic data (2022-02-23T00:30:04.584Z to 2022-02-23T00:30:04.584Z).

If you have more data from the first incident, or the issue happens again, please submit the additional information so we can investigate further.

 

Regards,

Christopher

Comment by Chris Kelly [ 09/Sep/22 ]

Vladimir,

Thank you for your patience on this issue.

If you still have the mongod logs for this issue (or are still experiencing the problem), can you please submit them in conjunction with the diagnostic data for more context?

As for what I saw from your initial submission, I do not see the same indicators as are present in SERVER-53738. Specifically, there is no recorded activity moving chunks on the primary or secondary here. FTDC is lost on your primary node at 2022-02-23T12:11:05.000Z. 

At point A (2022-02-23T12:10:11.053Z) there is a jump in the amount of insert ops recorded, which correlates to high disk utilization. In the middle of this, a checkpoint begins and we start to see the amount of active and queued writers rise quickly over the next few seconds.

At point B (2022-02-23T12:10:40.291Z) we see a substantial number of application threads spent evicting (denoted by ss wt thread-yield application thread time evicting). This can occur when wt cache dirty fill ratio exceeds 20%, which is also observed.

 

I'm not immediately sure what to make of this and would be interested to get additional logs, and a more specific timeline of events on your side if at all possible. I'd be interested in knowing when you're changing your chunk ranges, whether chunk migration is even occurring, or if this is something unrelated doing this. In the ticket you linked, we see indicators of moveChunk being initiated but I do not see that on either your primary or secondary data.

 

Regards,

Christopher 

Comment by Vladimir Beliakov [ 20/Jul/22 ]

which led me to this issue. Might be related.

Sorry, forgot to add the link. Here it is.

Generated at Thu Feb 08 06:09:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.