[SERVER-29190] moveChunk fails if the secondary member of the donor is down Created: 13/May/17 Updated: 27/Oct/23 Resolved: 16/May/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Sharding |
| Affects Version/s: | 3.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | VictorGP | Assignee: | Kaloian Manassiev |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | A setup of 2 shards (rs2 and rs1), each shard is a replicaset of 2 members primary and secondary, and an arbiter. Let's move a chunk from rs2 to rs1. In replicaset rs2, take the secondary member down. This is the balancer configuration in the config db:
Run the moveChunk command:
The logs in the primary member of rs2 shows tons of _transferMods commands until the migration is aborted with _recvChunkAbort. Attached you can find the logs of each primary. More info:
Important: we had a small outage, caused by this because it probably entered in the critical section described here https://github.com/mongodb/mongo/wiki/Sharding-Internals#migration-protocol-summary |
||||||||
| Participants: | |||||||||
| Description |
|
If a member is down in a replicaset of a sharded cluster, the chunk migration from the replicaset with the node down to another replicaset, fails with:
Even if the _secondaryThrottle is false and there is no write concern or there is wait concern 1 (both in the config.settings db and in the moveChunk command when executed manually) This is happening for the regular balancer, and after stopping it and running moveChunk manually i get the same output. |
| Comments |
| Comment by Kaloian Manassiev [ 16/May/17 ] | |
|
Thanks, Victor.
We take great precautions to not block the donor shard for too long in the critical section, so this is not expected. Neither is blocking on the recipient's side to that matter, but this can possibly be explained by replication primary stepdown due to a loss of quorum. However without the logs we won't be able to get more ideas about what is happening. I am going to close this ticket as 'Works as designed', but if you hit the same stall again, please try to preserve the logs if possible and open a new ticket. We can provide a secure upload portal if they contain sensitive information. Thank you in advance for your help. Best regards, | |
| Comment by VictorGP [ 16/May/17 ] | |
|
I might have missed that part in the documentation, thanks for the clarification. The operations were stuck also in the donor shard, i guess due to being a sharded collection so both shards are needed to be running properly. Unfortunately, i don't have more logs of when the issue happened. If this is something that you are aware of, and with Thank you for your support Kal EDIT: i see i can't close it, so if you are ok with that, you can go ahead | |
| Comment by Kaloian Manassiev [ 16/May/17 ] | |
|
Hi victorgp, The majority write requirements are described in the sharding balancer docs which states that
The _transferMods commands can be large in number if there are a lot of writes happening to the chunk being migrated. These won't cause performance degradation by themselves, they are just like a normal query (much like what replication does), except for those which happen after the shard enters the critical section (because that's when reads and writes to the collection stop), but those should be much less in number. The reason you are seeing another moveChunk following the failed one is because the balancer sees the failure and just retries it on the next round.
With this fix, chunk migration won't start at all if the secondary can't do majority writes. You mention that
Were the operations stuck only on the shard which had a member down or on the donor shard as well? Because migration commit has a deadline of at most 30 seconds so that the donor doesn't spend too much time in the critical section. That way even though the recipient could be stuck trying to commit, the donor will give up much earlier. Would it be possible to attach the complete logs from the entire event from both the donor's and recipient's primaries so we can have a look? Best regards, | |
| Comment by VictorGP [ 16/May/17 ] | |
|
Hi Kaloian, Ok, i understand. I think you should mention in the documentation of the chunks migration that if there is no majority members available in that shard, the chunk migration fails because of these majority write concern in the recipient. Otherwise people will see these errors and won't understand why. The outage we had, i'm not sure exactly why it happened but the shard that had a member down wasn't accepting reads or writes, so for any sharded collection that goes to both shards, the queries were stuck completely. This lasted for around 5 minutes, i stopped the mongod process (sigterm) and after another 4 minutes stopping, it finally stopped. My guess is that, for whatever reason, it stayed in the critical section for more time that expected because according to the protocol:
The only thing i could see in the logs were tons of _transferMods COMMANDS, so i think it could be related to this, because these _transferMods COMMANDS might be causing an auto DoS, for a single chunk migration that failed, i counted 5871 in ~20 seconds. Don't you think those are too many and can cause issues? They are being triggered constantly as long as the secondary member is down, MongoDB retries the moveChunk and again, tons of _transferMods. Glad you have this issue open: https://jira.mongodb.org/browse/SERVER-22876 but, what would happen after that fix? the chunk migration won't start or the write concern will move to 1 instead of majority? | |
| Comment by Kaloian Manassiev [ 15/May/17 ] | |
|
Hi victorgp, Regardless of the secondary throttle setting, the recipient shard must always do a majority write at the end of the migration process (during the critical section). Otherwise you may end up in a situation where the migration has been committed on the config server, but documents or updates get lost because of rollback on the recipient. The other way around it is not a problem, because the orphaned data cleanup process on the donor is asynchronous and if it fails, it does not cause a correctness problem.
What was the nature of the outage and how did you recover from it? Was is that the critical section took too much time waiting for the majority write concern to be satisfied? Best regards, PS: We have a ticket ( |