[SERVER-68124] Primary replica member hangs during chunk migrating Created: 19/Jul/22 Updated: 11/Oct/22 Resolved: 11/Oct/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.2.17 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Vladimir Beliakov | Assignee: | Chris Kelly |
| Resolution: | Done | Votes: | 0 |
| Labels: | balancing | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 16.04 |
||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
We're using `sh.addTagRange` to set custom chunk ranges. And usually the ranges are fixed and the chunks are not moved between the shards. However, the other day we needed to change the ranges which made the balancer move chunks. Within 15-20 minutes afterwards the primary of some of the shards got unresponsive and the whole shard cluster hang. diagnostic.data.zip Also, I found this message in the logs
which led me to this issue. Might be related. |
| Comments |
| Comment by Chris Kelly [ 11/Oct/22 ] |
|
We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket. |
| Comment by Chris Kelly [ 27/Sep/22 ] |
|
Hi Vladimir, Just a reminder, in order to investigate this further we need additional information. Specifically, we are missing:
We are missing indicators that relate your submitted diagnostic data to If you have more data from the first incident, or the issue happens again, please submit the additional information so we can investigate further.
Regards, Christopher |
| Comment by Chris Kelly [ 09/Sep/22 ] |
|
Vladimir, Thank you for your patience on this issue. If you still have the mongod logs for this issue (or are still experiencing the problem), can you please submit them in conjunction with the diagnostic data for more context? As for what I saw from your initial submission, I do not see the same indicators as are present in At point A (2022-02-23T12:10:11.053Z) there is a jump in the amount of insert ops recorded, which correlates to high disk utilization. In the middle of this, a checkpoint begins and we start to see the amount of active and queued writers rise quickly over the next few seconds. At point B (2022-02-23T12:10:40.291Z) we see a substantial number of application threads spent evicting (denoted by ss wt thread-yield application thread time evicting). This can occur when wt cache dirty fill ratio exceeds 20%, which is also observed.
I'm not immediately sure what to make of this and would be interested to get additional logs, and a more specific timeline of events on your side if at all possible. I'd be interested in knowing when you're changing your chunk ranges, whether chunk migration is even occurring, or if this is something unrelated doing this. In the ticket you linked, we see indicators of moveChunk being initiated but I do not see that on either your primary or secondary data.
Regards, Christopher |
| Comment by Vladimir Beliakov [ 20/Jul/22 ] |
Sorry, forgot to add the link. Here it is. |