[SERVER-45163] Balancer fails with moveChunk.error 'waiting for replication timed out' despite _secondaryThrottle unset Created: 16/Dec/19 Updated: 20/Dec/19 Resolved: 19/Dec/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Sharding |
| Affects Version/s: | 4.2.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jascha Brinkmann | Assignee: | Dmitry Agranat |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | Deploy a Sharded Cluster with at least 2 shards, deployed as a PSA Replica Set. Begin filling a sharded collection with lots of data so that the balancer starts running. After a short while when checking the sharding status via sh.status() errors similar to the one below will appear: {{ Failed with error 'aborted', from shard3rs to shard2rs}} |
||||||||
| Participants: | |||||||||
| Description |
|
I have a MongoDB Deployment with 3 shards, all deployed as a PSA (Primary, Secondary, Arbiter) Replica Set.
On the shard2rs the chunk was being moved to I see the same:
So why does the migration still fails with an error of "waiting for replication timed out"? If necessary I can supply logs of the whole cluster to a secure upload. (Unsure if the Jira file attachment makes them publicly accesible) |
| Comments |
| Comment by Jascha Brinkmann [ 20/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have been running this for a couple of hours more. Result is the same:
Moving Chunks works for a short while but then all successive balancing operations fail after the first encountered moveChunk.error. The only avail is completely restart the whole cluster where it works again for a short while after entering into the same vicious loop. The only difference to before is that I now see hundreds of these errors in the logs:
CPU is once again up 100% and cluster comes to a halt. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jascha Brinkmann [ 20/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hey Dmitry, thanks for your input. We are still in the testing phase so as I wrote in the community channel I've tried with different type of node sizes still encountering the same issue. After your suggestion I have tried once more, upgrading all cluster nodes to 4 CPUs and 8GB RAM. The cache size is now 14 times larger. While the initial import to the sharded collection works fine and without any major problems, the balancer is still having troubles in moving Chunks afterwards. To test this I removed the other shards from the cluster via db.adminCommand( { removeShard: "shard2rs" }) in order to get the balancer to rebalance all chunks back to the first shard. After a short while I see once again a lot of moveChunk.errors: (Please note that these stats are for a brand new cluster)
This time I do see an error message on the first shard in addition to the moveChunk.errors in the other shards:
Only about 40% of the moveChunk commands seems to be successfull. Is this the expected behaviour? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 19/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi jascha.brinkmann+mongodb@gmail.com, With 1 Core and 250 MB cache size per node, this cluster is under-provisioned. Please refer to our Production Notes to make sure your cluster is configured to sustain your workload. As you can see bellow, chunks are being aborted (marked A, B, C and D) when your single core is pegged. This in turn creates and snow-ball effect of lagging secondaries which add more pressure on the cache and so on. I will go ahead and close this ticket as I do not see MongoDB bug in this case. I also see you have already started a conversation in our community channel to discuss troubleshooting and tuning options. Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jascha Brinkmann [ 18/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hey Dima, I have uploaded all the requested files. As far as I am aware there is no diagnostic.data or db directory for mongoS nodes or at least I wasn't able to locate it. So I have only uploaded the log files for mongoS nodes. Please note that the cluster currently only has two mongoS nodes instead of three. These are all nodes:
Please note as well that I have since wiped at least once the data directory of one of the secondary instances of the replica sets in order to encourage an inital sync and get the replica lag back down to zero. So the diagnostics.data for some of the secondary nodes might be incomplete. Although it looks like I was able to recover the previous contents but just wanted you to be aware. Please let me know if you need anything else. Thanks and best regards | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 18/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi jascha.brinkmann+mongodb@gmail.com, Thank you for the provided information so far. In order to further investigate this issue, we'll need to collect some more data. Please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location. We will need this information from all shard members, 9 replica set nodes (PSS * 3) and 3 mongoS nodes. Per each node we'll need both mongod.log and diagnostic.data. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jascha Brinkmann [ 17/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Shard 1:
Shard 2:
Shard 3:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 17/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi jascha.brinkmann+mongodb@gmail.com, thank you for clarification and providing more context. One last thing I'd like to check before collecting all the information, could you post the output of rs.conf() from both shards? Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jascha Brinkmann [ 17/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hey Dmitry, thanks for your response. Sorry that I forgot to mention it, but I did had {{ readConcern: majority }} disabled as advised in the documentation. It seems that I was able to resolve the specific timeout error by restarting the whole cluster. As I was still getting lots of {{ moveChunk.errors }} I removed the whole cluster and set it up once more, this time as a PSS (Primary, Secondary, Secondary) deployment as I was suspecting that maybe taking advantage of flow control would resolve the balancer issues. But I still encounter the same problem: Hundreds of {{ moveChunk.errors }}. This time the error doesn't seem to be due to timedout replication as there is at least one replica set always up to date (even so one of the secondaries of a replica set always seems to have issues in keeping up) but due to something else.
Unfortunately in the logs on shard3rs I only see a single entry for each failed {{ moveChunk }} command:
Only very occasionally (maybe for every 100 failed moveChunk commands) I see this:
But the second {{ Chunk move failed }} error actually seems to be unrelated to the previous {{ moveChunk.error }} right before it as it is already for the next {{ moveChunk }} command following immediately afterwards. So those two errors are actually only related as that it still hasn't completely aborted the failed first {{ moveChunk }} and is already trying to start it again as you can see with the matching {{ { subscriberId: -527165117499500881 }, { subscriberId: -513290454848528462 } }} . Thats as well why I only see it ocassionally as most of the time it just continues trying to move the next chunk without problem until it fails again at the end. There are no errrors within the {{ shard2rs }} and {{ shard1rs }} logs. The constant aborted moving of chunks seems to be as well the reason why my replica sets can't keep up with the constant application and rolling back of changes:
Happy to share the logs, just let me know if it is save to upload them here. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 17/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi jascha.brinkmann+mongodb@gmail.com, In your comment, you've mentioned this replica set uses arbiters, and (I assume) readConcern:majority is enabled. This is not a recommended configuration. Please see our documentation how to address this issue. Once this is addressed and you still have the same issue, please let us know and we'll be happy to take a look. Thanks, |