Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-29190

moveChunk fails if the secondary member of the donor is down

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Works as Designed
    • Affects Version/s: 3.4.3
    • Fix Version/s: None
    • Component/s: Replication, Sharding
    • Labels:
      None
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      A setup of 2 shards (rs2 and rs1), each shard is a replicaset of 2 members primary and secondary, and an arbiter.

      Let's move a chunk from rs2 to rs1. In replicaset rs2, take the secondary member down.

      This is the balancer configuration in the config db:

      mongos> db.settings.find({_id:"balancer"})
      { "_id" : "balancer", "stopped" : true, "_secondaryThrottle" : false, "mode" : "off" }
      

      Run the moveChunk command:

      mongos> db.runCommand({moveChunk:"agent.pageloadHarEntries", bounds:[{ _id: "a23036:108:1483038000:1483038060" },{ _id: "a23036:108:1483801200:1483801260" }], to: "rs1", _secondaryThrottle:false})
      {
          "code" : 96,
          "ok" : 0,
          "errmsg" : "moveChunk command failed on source shard. :: caused by :: WriteConcernFailed: waiting for replication timed out"
      }
      

      The logs in the primary member of rs2 shows tons of _transferMods commands until the migration is aborted with _recvChunkAbort.

      Attached you can find the logs of each primary.

      More info:

      • I tried changing the _secondaryThrottle to true with writeConcern 1 so it only waits for the primary, but the same happens.
      • If i start the secondary member, and repeat the moveChunk, it works.
      • With the secondary member down, if i move the chunk the other way around, from rs1 to rs2, it works.

      Important: we had a small outage, caused by this because it probably entered in the critical section described here https://github.com/mongodb/mongo/wiki/Sharding-Internals#migration-protocol-summary

      Show
      A setup of 2 shards (rs2 and rs1), each shard is a replicaset of 2 members primary and secondary, and an arbiter. Let's move a chunk from rs2 to rs1. In replicaset rs2, take the secondary member down. This is the balancer configuration in the config db: mongos> db.settings.find({_id: "balancer" }) { "_id" : "balancer" , "stopped" : true , "_secondaryThrottle" : false , "mode" : "off" } Run the moveChunk command: mongos> db.runCommand({moveChunk: "agent.pageloadHarEntries" , bounds:[{ _id: "a23036:108:1483038000:1483038060" },{ _id: "a23036:108:1483801200:1483801260" }], to: "rs1" , _secondaryThrottle: false }) { "code" : 96 , "ok" : 0 , "errmsg" : "moveChunk command failed on source shard. :: caused by :: WriteConcernFailed: waiting for replication timed out" } The logs in the primary member of rs2 shows tons of _transferMods commands until the migration is aborted with _recvChunkAbort. Attached you can find the logs of each primary. More info: I tried changing the _secondaryThrottle to true with writeConcern 1 so it only waits for the primary, but the same happens. If i start the secondary member, and repeat the moveChunk, it works. With the secondary member down, if i move the chunk the other way around, from rs1 to rs2, it works. Important: we had a small outage, caused by this because it probably entered in the critical section described here https://github.com/mongodb/mongo/wiki/Sharding-Internals#migration-protocol-summary

      Description

      If a member is down in a replicaset of a sharded cluster, the chunk migration from the replicaset with the node down to another replicaset, fails with:

      moveChunk command failed on source shard. :: caused by :: WriteConcernFailed: waiting for replication timed out
      

      Even if the _secondaryThrottle is false and there is no write concern or there is wait concern 1 (both in the config.settings db and in the moveChunk command when executed manually)

      This is happening for the regular balancer, and after stopping it and running moveChunk manually i get the same output.

        Attachments

        1. rs1primary.log
          4 kB
        2. rs2primary.log
          7 kB

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: