[SERVER-29190] moveChunk fails if the secondary member of the donor is down Created: 13/May/17  Updated: 27/Oct/23  Resolved: 16/May/17

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: 3.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: VictorGP Assignee: Kaloian Manassiev
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File rs1primary.log     File rs2primary.log    
Issue Links:
Related
is related to SERVER-22876 Ensure 'majority' writes are possible... Closed
Operating System: ALL
Steps To Reproduce:

A setup of 2 shards (rs2 and rs1), each shard is a replicaset of 2 members primary and secondary, and an arbiter.

Let's move a chunk from rs2 to rs1. In replicaset rs2, take the secondary member down.

This is the balancer configuration in the config db:

mongos> db.settings.find({_id:"balancer"})
{ "_id" : "balancer", "stopped" : true, "_secondaryThrottle" : false, "mode" : "off" }

Run the moveChunk command:

mongos> db.runCommand({moveChunk:"agent.pageloadHarEntries", bounds:[{ _id: "a23036:108:1483038000:1483038060" },{ _id: "a23036:108:1483801200:1483801260" }], to: "rs1", _secondaryThrottle:false})
{
    "code" : 96,
    "ok" : 0,
    "errmsg" : "moveChunk command failed on source shard. :: caused by :: WriteConcernFailed: waiting for replication timed out"
}

The logs in the primary member of rs2 shows tons of _transferMods commands until the migration is aborted with _recvChunkAbort.

Attached you can find the logs of each primary.

More info:

  • I tried changing the _secondaryThrottle to true with writeConcern 1 so it only waits for the primary, but the same happens.
  • If i start the secondary member, and repeat the moveChunk, it works.
  • With the secondary member down, if i move the chunk the other way around, from rs1 to rs2, it works.

Important: we had a small outage, caused by this because it probably entered in the critical section described here https://github.com/mongodb/mongo/wiki/Sharding-Internals#migration-protocol-summary

Participants:

 Description   

If a member is down in a replicaset of a sharded cluster, the chunk migration from the replicaset with the node down to another replicaset, fails with:

moveChunk command failed on source shard. :: caused by :: WriteConcernFailed: waiting for replication timed out

Even if the _secondaryThrottle is false and there is no write concern or there is wait concern 1 (both in the config.settings db and in the moveChunk command when executed manually)

This is happening for the regular balancer, and after stopping it and running moveChunk manually i get the same output.



 Comments   
Comment by Kaloian Manassiev [ 16/May/17 ]

Thanks, Victor.

The operations were stuck also in the donor shard.

We take great precautions to not block the donor shard for too long in the critical section, so this is not expected. Neither is blocking on the recipient's side to that matter, but this can possibly be explained by replication primary stepdown due to a loss of quorum. However without the logs we won't be able to get more ideas about what is happening.

I am going to close this ticket as 'Works as designed', but if you hit the same stall again, please try to preserve the logs if possible and open a new ticket. We can provide a secure upload portal if they contain sensitive information.

Thank you in advance for your help.

Best regards,
-Kal.

Comment by VictorGP [ 16/May/17 ]

I might have missed that part in the documentation, thanks for the clarification.

The operations were stuck also in the donor shard, i guess due to being a sharded collection so both shards are needed to be running properly. Unfortunately, i don't have more logs of when the issue happened.

If this is something that you are aware of, and with SERVER-22876 fixed the migration won't start, i think it is fair enough to close this issue, i can wait for that fix.

Thank you for your support Kal

EDIT: i see i can't close it, so if you are ok with that, you can go ahead

Comment by Kaloian Manassiev [ 16/May/17 ]

Hi victorgp,

The majority write requirements are described in the sharding balancer docs which states that

MongoDB briefly pauses all application writes to the source shard before updating the config servers with the new location for the chunk, and resumes the application writes after the update. The chunk move requires all writes to be acknowledged by majority of the members of the replica set both before and after committing the chunk move to config servers.

The _transferMods commands can be large in number if there are a lot of writes happening to the chunk being migrated. These won't cause performance degradation by themselves, they are just like a normal query (much like what replication does), except for those which happen after the shard enters the critical section (because that's when reads and writes to the collection stop), but those should be much less in number. The reason you are seeing another moveChunk following the failed one is because the balancer sees the failure and just retries it on the next round.

Glad you have this issue open: https://jira.mongodb.org/browse/SERVER-22876 but, what would happen after that fix? the chunk migration won't start or the write concern will move to 1 instead of majority?

With this fix, chunk migration won't start at all if the secondary can't do majority writes.

You mention that

... the shard that had a member down wasn't accepting reads or writes, so for any sharded collection that goes to both shards, the queries were stuck completely.

Were the operations stuck only on the shard which had a member down or on the donor shard as well? Because migration commit has a deadline of at most 30 seconds so that the donor doesn't spend too much time in the critical section. That way even though the recipient could be stuck trying to commit, the donor will give up much earlier.

Would it be possible to attach the complete logs from the entire event from both the donor's and recipient's primaries so we can have a look?

Best regards,
-Kal.

Comment by VictorGP [ 16/May/17 ]

Hi Kaloian,

Ok, i understand. I think you should mention in the documentation of the chunks migration that if there is no majority members available in that shard, the chunk migration fails because of these majority write concern in the recipient. Otherwise people will see these errors and won't understand why.

The outage we had, i'm not sure exactly why it happened but the shard that had a member down wasn't accepting reads or writes, so for any sharded collection that goes to both shards, the queries were stuck completely. This lasted for around 5 minutes, i stopped the mongod process (sigterm) and after another 4 minutes stopping, it finally stopped. My guess is that, for whatever reason, it stayed in the critical section for more time that expected because according to the protocol:

Once it sees the recipient is ready, the donor enters the "critical section." This means the donor does not accept any reads or writes.

The only thing i could see in the logs were tons of _transferMods COMMANDS, so i think it could be related to this, because these _transferMods COMMANDS might be causing an auto DoS, for a single chunk migration that failed, i counted 5871 in ~20 seconds. Don't you think those are too many and can cause issues? They are being triggered constantly as long as the secondary member is down, MongoDB retries the moveChunk and again, tons of _transferMods.

Glad you have this issue open: https://jira.mongodb.org/browse/SERVER-22876 but, what would happen after that fix? the chunk migration won't start or the write concern will move to 1 instead of majority?

Comment by Kaloian Manassiev [ 15/May/17 ]

Hi victorgp,

Regardless of the secondary throttle setting, the recipient shard must always do a majority write at the end of the migration process (during the critical section). Otherwise you may end up in a situation where the migration has been committed on the config server, but documents or updates get lost because of rollback on the recipient.

The other way around it is not a problem, because the orphaned data cleanup process on the donor is asynchronous and if it fails, it does not cause a correctness problem.

... we had a small outage, caused by this because it probably entered in the critical section described here

What was the nature of the outage and how did you recover from it? Was is that the critical section took too much time waiting for the majority write concern to be satisfied?

Best regards,
-Kal.

PS: We have a ticket (SERVER-22876), which would reduce the likelihood of such errors by ensuring that all the nodes involved in the migration protocol can do majority writes before it even enters the critical section.

Generated at Thu Feb 08 04:20:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.