[SERVER-57846] Balancer hanging Created: 18/Jun/21 Updated: 29/Jun/21 Resolved: 29/Jun/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Wernfried Domscheit | Assignee: | Eric Sedor |
| Resolution: | Done | Votes: | 0 |
| Labels: | sharding | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
I have a Sharded Cluster and the Balancer seems to hang, I have several unbalanced collections:
Sharding status is like this. Apparently MongoDB hangs while balancing collection "mip.statistics"
As a quick solution I tried to drop the culprit collection but no success:
I can insert or delete data from this collection, drop and create indexes but dropping it is not possible. I also stopped/started the Balancer - no success I even restarted the entire Sharded Cluster - no success either
|
| Comments |
| Comment by Eric Sedor [ 29/Jun/21 ] | ||||||||||||||||||||||||||||||||||||||
|
Understood wernfried.domscheit@sunrise.net; I'll close this ticket but if this or a similar issue occurs again we'll be happy to take another look. Sincerely, | ||||||||||||||||||||||||||||||||||||||
| Comment by Wernfried Domscheit [ 29/Jun/21 ] | ||||||||||||||||||||||||||||||||||||||
|
The new deployment is running fine. And the old one was running fine for months. One reason could be our backup. The host are all virtual and once a night they go offline for 1-2 seconds for filesystem snapshot. I set balancer activeWindow accordingly, if this issue was caused by our backup, it should not occur again.
| ||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 28/Jun/21 ] | ||||||||||||||||||||||||||||||||||||||
|
Hi wernfried.domscheit@sunrise.net, From the logs for this config server it looks like one balancer failure was due to an issue with shard_01:
Two were caused by a failover on the config server replica set (note the InterruptedDueToReplStateChange error):
I do see a window of time in these logs where I'm not sure why we don't see balancer activity (from 2021-06-18T13:19:01.345Z to 2021-06-18T21:59:01.410Z), but it sounds like we won't be able to get more information to investigate further why this might have been. Has this issue recurred on the new deployment? Sincerely, | ||||||||||||||||||||||||||||||||||||||
| Comment by Wernfried Domscheit [ 24/Jun/21 ] | ||||||||||||||||||||||||||||||||||||||
|
Hi Eric I uploaded the logfile.
Sorry, the diagnostic file is not available anymore. This mongo deployment is still a "proof of concept", so I wiped it and deployed a new one. Next time I will make a copy of the dagnostic path. Best Regards
| ||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 23/Jun/21 ] | ||||||||||||||||||||||||||||||||||||||
|
Hi wernfried.domscheit@sunrise.net, We'd like to understand what the balancer is doing. Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) for the primary node of the config server replica set? Then, upload that archive to this this portal? Files uploaded here are only visible to MongoDB employees and are routinely deleted after some time. Thank you, | ||||||||||||||||||||||||||||||||||||||
| Comment by Wernfried Domscheit [ 21/Jun/21 ] | ||||||||||||||||||||||||||||||||||||||
|
By using
I managed it to drop the collection. However, the Balancer turns to the next collection, attempts to balance it and hangs again. Best Regards | ||||||||||||||||||||||||||||||||||||||
| Comment by Wernfried Domscheit [ 18/Jun/21 ] | ||||||||||||||||||||||||||||||||||||||
|
Some more information:
|