Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-99357

Concurrent migration and drop indexes could leave inconsistent indexes in the cluster

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 8.1.0-rc0, 8.0.5, 7.0.17
    • Component/s: Sharding
    • None
    • Catalog and Routing
    • ALL
    • Hide

      1. Apply the attached patch.
      2. Run it with buildscripts/resmoke.py run --suites=sharding jstests/sharding/inconsistent_indexes_fail_migrations.js

      Show
      1. Apply the attached patch. 2. Run it with buildscripts/resmoke.py run --suites=sharding jstests/sharding/inconsistent_indexes_fail_migrations.js
    • 0

      Currently a dropIndex command that runs concurrently with a migration aborts it when commiting the drop. This works in the majority of scenarios, however, considering that the drop indexes command currently uses a shard version retry loop to send the command to all data bearing shards, the following operation interleaving might happen:

      • A drop indexes is received in a router, it reaches the primary db shard, and then send the commands throughout the cluster to the data bearing shards
      • A migration starts and reaches the point where there is a source and a destination manager instantiated
      • The drop index is received and executed in the recipient shard
      • The migration destination manager copies the indexes from the source shard
      • The drop index is received and executed in the source shard, aborting the migration

      Generating an index inconsistency in the cluster. If every shard had a chunk before this happens, and, there were documents in the source, but no documents in the recipient, then any subsequent migration going the other way (as in, from the former destination shard to the former source shard) will fail, because the former source shard (now a destination shard in this example) will find that there are documents but inconsistent indexes.

      In the field, a customer might not be able to drain a shard being removed using removeShard and by extenstion might not finish a transition to dedicated config server considering transitionToDedicatedConfigServer uses the remove shard machinery.

            Assignee:
            Unassigned Unassigned
            Reporter:
            marcos.grillo@mongodb.com Marcos José Grillo Ramirez
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: