Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22535

Some index operations (drop index, abort index build, update TTL config) on collection during active migration can cause migration to skip documents

    • Fully Compatible
    • ALL
    • Query 10 (02/22/16)

      The migration logic on the donor shard that performs the initial index scan for documents to clone does not handle invalidations properly, and will generate a truncated set of documents to clone if the executor is killed during the index scan.

      As a result, performing an index operation that invalidates plan executors at the same time that the initial index scan for a migration is yielding will cause some documents to not be transferred during the migration, and these documents will be deleted from the cluster during the next migration cleanup job.

      The following index operations invalidate plan executors, and thus are able to trigger this issue:

      • Dropping an index with the dropIndexes command.
      • Aborting an index build with killOp().
      • Updating the TTL configuration for an index with the collMod command.

      This is a regression introduced in version 1.7.2 by 9923c7b6, and affects all versions released since.

      The following script will reproduce this issue:

      var numDocs = 10000;
      
      // Set up cluster.
      var st = new ShardingTest({shards: 2});
      var s = st.s0;
      var d1 = st.shard1;
      var coll = s.getDB("test").foo;
      assert.commandWorked(s.adminCommand({enableSharding: coll.getDB().getName()}));
      assert.commandWorked(s.adminCommand({shardCollection: coll.getFullName(), key: {_id: "hashed"}}));
      for (i=0; i<numDocs; i++) {
          coll.insert({_id: i});
      }
      assert.commandWorked(coll.ensureIndex({a: 1}));
      
      // Check document count.
      assert.eq(numDocs, coll.find().itcount());
      
      // Configure server to increase reproducibility.
      assert.commandWorked(d1.adminCommand({setParameter: 1, internalQueryExecYieldIterations: 2}));
      assert.commandWorked(d1.adminCommand({configureFailPoint: "setYieldAllLocksWait", mode: "alwaysOn",
                                            data: {namespace:"test.foo", waitForMillis: 100}}));
      
      // Initiate migration and index drop in parallel.
      shell = startParallelShell("sleep(1000); assert.commandWorked(db.foo.dropIndex({a: 1}));", s.port);
      assert.commandWorked(s.adminCommand({moveChunk: coll.getFullName(), find: {_id: 0}, to: "shard0000",
                                           _waitForDelete: true}));
      shell();
      assert.commandWorked(d1.adminCommand({configureFailPoint: "setYieldAllLocksWait", mode: "off"}));
      
      // Re-check document count.
      assert.eq(numDocs, coll.find().itcount());
      

      When run locally with version 3.2.1, the above script fails on the last line with the following:

      2016-02-09T11:05:11.076-0500 E QUERY    [thread1] Error: [10000] != [7541] are not equal : undefined
      

            Assignee:
            tess.avitabile@mongodb.com Tess Avitabile (Inactive)
            Reporter:
            rassi J Rassi
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated:
              Resolved: