SUMMARY
A reshardCollection operation on a cluster with at least two shards can potentially omit a collections catalog entry in a shards local catalog if a movePrimary operation had previously been issued on the same cluster. This impacts operations involving data migration that lookup the database cache namely movePrimary, reshardCollection, moveRange, and moveChunk.
The issue affects MongoDB versions since resharding was released in v5.0. The following versions contain fixes:
5.0.26, 6.0.15, 7.0.8, 7.3.2, 8.0.0
ISSUE DESCRIPTION AND IMPACT
When a movePrimary operation is followed by a reshardCollection operation,the new primary shard may not have a catalog entry for the resharded collection if it does not own any chunks under the new distribution.
This impacts operations involving data migration that rely on the database cache, such as movePrimary, reshardCollection, moveRange, and moveChunk. Consequently, listCollections command which looks up info from the primary shard will not show the resharded collection name.
Tools involving data cloning/backup, such as mongosync, mongodump, and mongoexport are also impacted and will miss the resharded collection.
Note that CRUD operations on the collection will continue to work correctly as the router will target reads and writes to the correct shards.
WORKAROUND
For users on affected versions, run the following command on the config server before executing reshardCollection:
db.adminCommand({ flushRouterConfig: "<db>" }), where <db> is the name of the database movePrimary ran on.
DIAGNOSIS & REMEDIATION
To diagnose and remediate the issue, you should:
1. Upgrade to one of the fixed versions mentioned above.
2. Run the script to confirm if you are impacted and address the underlying issue. Please review the README carefully before running the script. If you have any questions please open a support case or start a chat with the Atlas Support team.
Original Description
Resharding coordinator force a refresh of the collection routing info cache and then extracts the database primary shard from it.
While this ensures that the collection metadata retrieved is causally consistent with the latest DDL operation executed on the collection itself, it does not guarantee that the database metadata is causally consistent with the latest DDL operations executed on the database.
In fact forcing a refresh of the collection routing info does not also force a refresh of the database info cache. This means that the database primary shard exposed through the collection routing info cache could be stale.
If resharding coordinator uses a stale database primary shard information, it could happen that it will not include the current database primary shard in the set of recipient shard of the resharding operation. The result is that the resharding operation will miss updating the state of the target collection on the database primary shard, leaving the local catalog on that shard in an inconsistent state. In particular, if the db primary shard doesn't own any chunk for the resharded collection, it could happen that it won't have the collection on its local catalog after the resharding operation has finished.
This is particularly problematic because DDL operations rely on the assumption that the database primary shard always has correct and up-to-date information about collections in the database the node is primary for.
- is related to
-
SERVER-86671 CollectionRoutingInfo could contain stale database information even after refresh
- Closed
- related to
-
SERVER-88417 processReshardingFieldsForRecipientCollection can use stale db info and incorrectly creates a recipient
- Closed