Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-42783

Migrations don't wait for majority replication of cloned documents if there are no transfer mods

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 3.4.15, 3.6.4, 4.0.0, 4.2.0
    • Fix Version/s: 3.6.15, 4.3.1
    • Component/s: Sharding
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.2, v4.0, v3.6, v3.4
    • Steps To Reproduce:
      Hide

      Run the following test that shows majority committed documents may not be in the recipient shard's majority committed snapshot after a migration:

      (function() {
      "use strict";
       
      // Set up a sharded cluster with two shards, two chunks, and one document in one of the chunks.
      const st = new ShardingTest({shards: 2, rs: {nodes: 2}, config: 1});
      const testDB = st.s.getDB("test");
       
      assert.commandWorked(testDB.foo.insert({_id: 1}, {writeConcern: {w: "majority"}}));
       
      st.ensurePrimaryShard("test", st.shard0.shardName);
      assert.commandWorked(st.s.adminCommand({enableSharding: "test"}));
      assert.commandWorked(st.s.adminCommand({shardCollection: "test.foo", key: {_id: 1}}));
      assert.commandWorked(st.s.adminCommand({split: "test.foo", middle: {_id: 0}}));
       
      // The document is in the majority committed snapshot.
      assert.eq(1, testDB.foo.find().readConcern("majority").itcount());
       
      // Advance a migration to the beginning of the cloning phase.
      assert.commandWorked(st.rs1.getPrimary().adminCommand(
          {configureFailPoint: 'migrateThreadHangAtStep2', mode: "alwaysOn"}));
      TestData.toShard = st.shard1.shardName;
      const awaitMigration = startParallelShell(() => {
          jsTestLog("moving to " + TestData.toShard);
          assert.commandWorked(
              db.adminCommand({moveChunk: "test.foo", find: {_id: 1}, to: TestData.toShard}));
      }, st.s.port);
       
      // Sleep to let the migration reach the failpoint and allow any writes to become majority committed
      // before pausing replication.
      sleep(3000);
      st.rs1.awaitLastOpCommitted();
       
      // Disable replication on the recipient shard's secondary node, so the recipient shard's majority
      // commit point cannot advance.
      const destinationSec = st.rs1.getSecondary();
      assert.commandWorked(
          destinationSec.adminCommand({configureFailPoint: "rsSyncApplyStop", mode: "alwaysOn"}),
          "failed to enable fail point on secondary");
       
      // Allow the migration to begin cloning.
      assert.commandWorked(st.rs1.getPrimary().adminCommand(
          {configureFailPoint: 'migrateThreadHangAtStep2', mode: "off"}));
       
      // The migration should finish cloning and commit without being able to advance the majority commit
      // point.
      awaitMigration();
       
      // The document is not in the recipient's majority committed snapshot, so the cloned document cannot
      // be found by a majority read and this assertion fails.
      assert.eq(
          1, testDB.foo.find().readConcern("majority").itcount(), "cloned doc not in majority snapshot!");
       
      assert.commandWorked(
          destinationSec.adminCommand({configureFailPoint: "rsSyncApplyStop", mode: "off"}),
          "failed to disable fail point on secondary");
       
      st.stop();
      })();
      

      Show
      Run the following test that shows majority committed documents may not be in the recipient shard's majority committed snapshot after a migration: (function() { "use strict"; // Set up a sharded cluster with two shards, two chunks, and one document in one of the chunks. const st = new ShardingTest({shards: 2, rs: {nodes: 2}, config: 1}); const testDB = st.s.getDB("test");   assert.commandWorked(testDB.foo.insert({_id: 1}, {writeConcern: {w: "majority"}}));   st.ensurePrimaryShard("test", st.shard0.shardName); assert.commandWorked(st.s.adminCommand({enableSharding: "test"})); assert.commandWorked(st.s.adminCommand({shardCollection: "test.foo", key: {_id: 1}})); assert.commandWorked(st.s.adminCommand({split: "test.foo", middle: {_id: 0}}));   // The document is in the majority committed snapshot. assert.eq(1, testDB.foo.find().readConcern("majority").itcount());   // Advance a migration to the beginning of the cloning phase. assert.commandWorked(st.rs1.getPrimary().adminCommand( {configureFailPoint: 'migrateThreadHangAtStep2', mode: "alwaysOn"})); TestData.toShard = st.shard1.shardName; const awaitMigration = startParallelShell(() => { jsTestLog("moving to " + TestData.toShard); assert.commandWorked( db.adminCommand({moveChunk: "test.foo", find: {_id: 1}, to: TestData.toShard})); }, st.s.port);   // Sleep to let the migration reach the failpoint and allow any writes to become majority committed // before pausing replication. sleep(3000); st.rs1.awaitLastOpCommitted();   // Disable replication on the recipient shard's secondary node, so the recipient shard's majority // commit point cannot advance. const destinationSec = st.rs1.getSecondary(); assert.commandWorked( destinationSec.adminCommand({configureFailPoint: "rsSyncApplyStop", mode: "alwaysOn"}), "failed to enable fail point on secondary");   // Allow the migration to begin cloning. assert.commandWorked(st.rs1.getPrimary().adminCommand( {configureFailPoint: 'migrateThreadHangAtStep2', mode: "off"}));   // The migration should finish cloning and commit without being able to advance the majority commit // point. awaitMigration();   // The document is not in the recipient's majority committed snapshot, so the cloned document cannot // be found by a majority read and this assertion fails. assert.eq( 1, testDB.foo.find().readConcern("majority").itcount(), "cloned doc not in majority snapshot!");   assert.commandWorked( destinationSec.adminCommand({configureFailPoint: "rsSyncApplyStop", mode: "off"}), "failed to disable fail point on secondary");   st.stop(); })();
    • Sprint:
      Sharding 2019-08-26
    • Linked BF Score:
      23

      Description

      Before a chunk migration can commit, the shard receiving the chunk is supposed to wait for every document in the chunk to be majority committed to ensure no data can be lost if there's a failover. Currently, the recipient shard will wait for the maximum of the last opTime on the client of the thread driving the migration on the recipient at the end of the cloning phase and the opTime of the latest transfer mod write to be majority committed.

      SERVER-32885 changed migration recipients to insert documents received in the initial cloning phase on a separate thread, so the driving thread's last opTime at the end of the cloning phase will not be the opTime of the insert of the latest cloned document. If there are no transfer mod writes (the opTimes of which are correctly tracked and are guaranteed to be > the opTime of the inserts of any cloned documents), then the migration can commit without waiting for all cloned documents to be majority replicated.

      This has at least the following implications for migrations that involve no transfer mods:

      • Majority reads without causal consistency may fail to read documents inserted on the same session with majority write concern even without failovers (like in the attached repro).
      • A recipient shard stepdown after the migration commits may lead to a rollback of previously majority committed documents.

      SERVER-32885 was backported to 3.4, so this may be a problem from that release on, although I haven't manually verified this.

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: