[SERVER-22535] Some index operations (drop index, abort index build, update TTL config) on collection during active migration can cause migration to skip documents Created: 09/Feb/16  Updated: 17/Nov/16  Resolved: 09/Feb/16

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Sharding
Affects Version/s: None
Fix Version/s: 2.6.12, 3.0.10, 3.2.3, 3.3.2

Type: Bug Priority: Critical - P2
Reporter: J Rassi Assignee: Tess Avitabile (Inactive)
Resolution: Done Votes: 0
Labels: code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-13123 All callers of PlanExecutor::getNext ... Closed
is related to SERVER-23425 Inserts and updates during chunk migr... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: Query 10 (02/22/16)
Participants:

 Description   

The migration logic on the donor shard that performs the initial index scan for documents to clone does not handle invalidations properly, and will generate a truncated set of documents to clone if the executor is killed during the index scan.

As a result, performing an index operation that invalidates plan executors at the same time that the initial index scan for a migration is yielding will cause some documents to not be transferred during the migration, and these documents will be deleted from the cluster during the next migration cleanup job.

The following index operations invalidate plan executors, and thus are able to trigger this issue:

  • Dropping an index with the dropIndexes command.
  • Aborting an index build with killOp().
  • Updating the TTL configuration for an index with the collMod command.

This is a regression introduced in version 1.7.2 by 9923c7b6, and affects all versions released since.

The following script will reproduce this issue:

var numDocs = 10000;
 
// Set up cluster.
var st = new ShardingTest({shards: 2});
var s = st.s0;
var d1 = st.shard1;
var coll = s.getDB("test").foo;
assert.commandWorked(s.adminCommand({enableSharding: coll.getDB().getName()}));
assert.commandWorked(s.adminCommand({shardCollection: coll.getFullName(), key: {_id: "hashed"}}));
for (i=0; i<numDocs; i++) {
    coll.insert({_id: i});
}
assert.commandWorked(coll.ensureIndex({a: 1}));
 
// Check document count.
assert.eq(numDocs, coll.find().itcount());
 
// Configure server to increase reproducibility.
assert.commandWorked(d1.adminCommand({setParameter: 1, internalQueryExecYieldIterations: 2}));
assert.commandWorked(d1.adminCommand({configureFailPoint: "setYieldAllLocksWait", mode: "alwaysOn",
                                      data: {namespace:"test.foo", waitForMillis: 100}}));
 
// Initiate migration and index drop in parallel.
shell = startParallelShell("sleep(1000); assert.commandWorked(db.foo.dropIndex({a: 1}));", s.port);
assert.commandWorked(s.adminCommand({moveChunk: coll.getFullName(), find: {_id: 0}, to: "shard0000",
                                     _waitForDelete: true}));
shell();
assert.commandWorked(d1.adminCommand({configureFailPoint: "setYieldAllLocksWait", mode: "off"}));
 
// Re-check document count.
assert.eq(numDocs, coll.find().itcount());

When run locally with version 3.2.1, the above script fails on the last line with the following:

2016-02-09T11:05:11.076-0500 E QUERY    [thread1] Error: [10000] != [7541] are not equal : undefined



 Comments   
Comment by Githook User [ 07/Mar/16 ]

Author:

{u'username': u'tessavitabile', u'name': u'Tess Avitabile', u'email': u'tess.avitabile@mongodb.com'}

Message: SERVER-22535 Fix MigrateFromStatus::storeCurrentLocs() to not dereference invalid memory
Branch: v3.0
https://github.com/mongodb/mongo/commit/1e0512f8453d103987f5fbfb87b71e9a131c2a60

Comment by Githook User [ 04/Mar/16 ]

Author:

{u'username': u'tessavitabile', u'name': u'Tess Avitabile', u'email': u'tess.avitabile@mongodb.com'}

Message: SERVER-22535 Migration source manager checks for PlanExecutor errors during initial index scan for documents to clone
Branch: v2.6
https://github.com/mongodb/mongo/commit/b3ca937033e8794a670df9d1412dad9716d1eca0

Comment by Githook User [ 22/Feb/16 ]

Author:

{u'username': u'tessavitabile', u'name': u'Tess Avitabile', u'email': u'tess.avitabile@mongodb.com'}

Message: SERVER-22535 Migration source manager checks for PlanExecutor errors during initial index scan for documents to clone

Custom backport from f5a9081a412ada3fc8a472b267f932f76b345126
Branch: v3.0
https://github.com/mongodb/mongo/commit/92c072f6162ace680b733b8fef2dcd7b1c4c6b50

Comment by Githook User [ 09/Feb/16 ]

Author:

{u'username': u'tessavitabile', u'name': u'Tess Avitabile', u'email': u'tess.avitabile@mongodb.com'}

Message: SERVER-22535 Migration source manager checks for PlanExecutor errors during initial index scan for documents to clone

(cherry picked from commit f5a9081a412ada3fc8a472b267f932f76b345126)
Branch: v3.2
https://github.com/mongodb/mongo/commit/3e884ea31a0e52dc87c646a8f59d08c36c3b858a

Comment by Githook User [ 09/Feb/16 ]

Author:

{u'username': u'tessavitabile', u'name': u'Tess Avitabile', u'email': u'tess.avitabile@mongodb.com'}

Message: SERVER-22535 Migration source manager checks for PlanExecutor errors during initial index scan for documents to clone
Branch: master
https://github.com/mongodb/mongo/commit/f5a9081a412ada3fc8a472b267f932f76b345126

Comment by J Rassi [ 09/Feb/16 ]

This issue was discovered during a code audit, as part of the fix for SERVER-13123.

Generated at Thu Feb 08 04:00:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.