Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-96906

checkMetadataConsistency can hit tripwire assert while checking shard key index due to local collection dropped mid-check by dedicated config server transition

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 8.1.0-rc0, 8.0.0
    • Component/s: None
    • None
    • Catalog and Routing
    • ALL

      While investigating SERVER-94838, we found that checkMetadataConsistency can hit a tripwire assertion (code 7531700 "Collection unexpectedly disappeared while holding database DDL lock") while it checks that a sharded collection is supported by an index prefixed by the shard key.

      While the operations involved are similar to SERVER-94740, the root cause is different, namely that the checkMetadataConsistency doesn't expect a local "garbage" collection to be able to disappear while it holds the DDL lock, but the transition to a dedicated config server can do it, as it drops the database after it checks that it is empty.

      The high-level interleaving is as follows:
      1. Create a sharded cluster with a config shard.
      2. Create a sharded collection with chunks on the config shard, then move them out of the config shard. After this, the local collection will still exist on the config shard, and also in the sharding catalog.
      3. The still-remaining local collection in the config shard should not have an index supporting the shard key, as otherwise checkMetadataConsistency can complete the check without taking the collection lock. This can be achieved by creating a new index and then refining the shard key on it. Since the config shard does not own any chunks, the index will not be created there and the existing index will no longer support the shard key.
      4. Start a transition from the config shard to a dedicated config server up to the point right before blocking new DDL coordinators.
      5. Start a checkMetadataConsistency. It sees that the collection exists on both the local and sharding catalog and starts checking for shard key index inconsistencies, up until before it takes the collection lock for the local collection.
      6. The transition to a dedicated config server continues executing and drops the database on the config shard.
      7. The checkMetadataConsistency continues running and tries to take the collection lock. It detects that the local collection does not exist anymore and tasserts.

      A reproducer is attached.

        1. repro.patch
          7 kB
          Joan Bruguera Micó

            Assignee:
            Unassigned Unassigned
            Reporter:
            joan.bruguera-mico@mongodb.com Joan Bruguera Micó
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: