Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.4.3
Component/s: Replication, Sharding
Labels:
None

Operating System:
ALL
Steps To Reproduce:
Hide

A setup of 2 shards (rs2 and rs1), each shard is a replicaset of 2 members primary and secondary, and an arbiter.

Let's move a chunk from rs2 to rs1. In replicaset rs2, take the secondary member down.

This is the balancer configuration in the config db:

mongos> db.settings.find({_id:"balancer"}) { "_id" : "balancer", "stopped" : true, "_secondaryThrottle" : false, "mode" : "off" }

Run the moveChunk command:

mongos> db.runCommand({moveChunk:"agent.pageloadHarEntries", bounds:[{ _id: "a23036:108:1483038000:1483038060" },{ _id: "a23036:108:1483801200:1483801260" }], to: "rs1", _secondaryThrottle:false}) { "code" : 96, "ok" : 0, "errmsg" : "moveChunk command failed on source shard. :: caused by :: WriteConcernFailed: waiting for replication timed out" }

The logs in the primary member of rs2 shows tons of _transferMods commands until the migration is aborted with _recvChunkAbort.

Attached you can find the logs of each primary.

More info:

I tried changing the _secondaryThrottle to true with writeConcern 1 so it only waits for the primary, but the same happens.

If i start the secondary member, and repeat the moveChunk, it works.

With the secondary member down, if i move the chunk the other way around, from rs1 to rs2, it works.

Important: we had a small outage, caused by this because it probably entered in the critical section described here https://github.com/mongodb/mongo/wiki/Sharding-Internals#migration-protocol-summary
Show
A setup of 2 shards (rs2 and rs1), each shard is a replicaset of 2 members primary and secondary, and an arbiter. Let's move a chunk from rs2 to rs1. In replicaset rs2, take the secondary member down. This is the balancer configuration in the config db: mongos> db.settings.find({_id: "balancer" }) { "_id" : "balancer" , "stopped" : true , "_secondaryThrottle" : false , "mode" : "off" } Run the moveChunk command: mongos> db.runCommand({moveChunk: "agent.pageloadHarEntries" , bounds:[{ _id: "a23036:108:1483038000:1483038060" },{ _id: "a23036:108:1483801200:1483801260" }], to: "rs1" , _secondaryThrottle: false }) { "code" : 96, "ok" : 0, "errmsg" : "moveChunk command failed on source shard. :: caused by :: WriteConcernFailed: waiting for replication timed out" } The logs in the primary member of rs2 shows tons of _transferMods commands until the migration is aborted with _recvChunkAbort. Attached you can find the logs of each primary. More info: I tried changing the _secondaryThrottle to true with writeConcern 1 so it only waits for the primary, but the same happens. If i start the secondary member, and repeat the moveChunk, it works. With the secondary member down, if i move the chunk the other way around, from rs1 to rs2, it works. Important: we had a small outage, caused by this because it probably entered in the critical section described here https://github.com/mongodb/mongo/wiki/Sharding-Internals#migration-protocol-summary
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

If a member is down in a replicaset of a sharded cluster, the chunk migration from the replicaset with the node down to another replicaset, fails with:

moveChunk command failed on source shard. :: caused by :: WriteConcernFailed: waiting for replication timed out

Even if the _secondaryThrottle is false and there is no write concern or there is wait concern 1 (both in the config.settings db and in the moveChunk command when executed manually)

This is happening for the regular balancer, and after stopping it and running moveChunk manually i get the same output.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

rs1primary.log
4 kB
May 13 2017 01:03:24 AM UTC
rs2primary.log
7 kB
May 13 2017 01:03:24 AM UTC

is related to

SERVER-22876 Ensure 'majority' writes are possible before entering the migration critical section

Closed

Assignee:: Kaloian Manassiev
Reporter:: VictorGP
Participants:: Kaloian Manassiev, VictorGP
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: May 13 2017 01:07:21 AM UTC
Updated:: Oct 27 2023 01:54:28 PM UTC
Resolved:: May 16 2017 09:33:49 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates