Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-86622

Resharding coordinator use possibly stale database info

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 8.0.0-rc0, 5.0.26, 6.0.15, 7.0.8, 7.3.2
    • Affects Version/s: 5.0.0, 6.0.0, 7.0.0, 8.0.0-rc0, 7.3.0-rc3
    • Component/s: None
    • None
    • Catalog and Routing
    • Fully Compatible
    • ALL
    • v7.3, v7.2, v7.0, v6.0, v5.0
    • Hide

      Run the following test in the sharding suite

      resharding_stale_database_info.js

       

      Show
      Run the following test in the sharding suite resharding_stale_database_info.js  
    • CAR Team 2024-02-19
    • 2

      Issue Status as of December 19, 2024

      SUMMARY

      A reshardCollection operation on a cluster with at least two shards can potentially omit a collections catalog entry in a shards local catalog if a movePrimary operation had previously been issued on the same cluster. This impacts operations involving data migration that lookup the database cache namely movePrimary, reshardCollection, moveRange, and moveChunk.

      The issue affects MongoDB versions since resharding was released in v5.0. The following versions contain fixes:
      5.0.26, 6.0.15, 7.0.8, 7.3.2, 8.0.0

      ISSUE DESCRIPTION AND IMPACT

      When a movePrimary operation is followed by a reshardCollection operation,the new primary shard may not have a catalog entry for the resharded collection if it does not own any chunks under the new distribution.

      This impacts operations involving data migration that rely on the database cache, such as movePrimary, reshardCollection, moveRange, and moveChunk. Consequently, listCollections command which looks up info from the primary shard will not show the resharded collection name.

      Tools involving data cloning/backup, such as mongosync, mongodump, and mongoexport are also impacted and will miss the resharded collection.

      Note that CRUD operations on the collection will continue to work correctly as the router will target reads and writes to the correct shards.

      WORKAROUND

      For users on affected versions, run the following command on the config server before executing reshardCollection:

      db.adminCommand({ flushRouterConfig: "<db>" }), where <db> is the name of the database movePrimary ran on.
      

      DIAGNOSIS & REMEDIATION

      To diagnose and remediate the issue, you should:

      1. Upgrade to one of the fixed versions mentioned above.
      2. Run the script to confirm if you are impacted and address the underlying issue. Please review the README carefully before running the script. If you have any questions please open a support case or start a chat with the Atlas Support team.

      Original Description

      Resharding coordinator force a refresh of the collection routing info cache and then extracts the database primary shard from it.

      While this ensures that the collection metadata retrieved is causally consistent with the latest DDL operation executed on the collection itself, it does not guarantee that the database metadata is causally consistent with the latest DDL operations executed on the database.

      In fact forcing a refresh of the collection routing info does not also force a refresh of the database info cache. This means that the database primary shard exposed through the collection routing info cache could be stale.

      If resharding coordinator uses a stale database primary shard information, it could happen that it will not include the current database primary shard in the set of recipient shard of the resharding operation. The result is that the resharding operation will miss updating the state of the target collection on the database primary shard, leaving the local catalog on that shard in an inconsistent state. In particular, if the db primary shard doesn't own any chunk for the resharded collection, it could happen that it won't have the collection on its local catalog after the resharding operation has finished.

      This is particularly problematic because DDL operations rely on the assumption that the database primary shard always has correct and up-to-date information about collections in the database the node is primary for.

       

            Assignee:
            tommaso.tocci@mongodb.com Tommaso Tocci
            Reporter:
            tommaso.tocci@mongodb.com Tommaso Tocci
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: