[SERVER-68511] movePrimary might introduce sharding metadata inconsistency in MongoDB 5.0+ Created: 02/Aug/22 Updated: 29/Oct/23 Resolved: 04/Aug/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 6.0.1, 5.0.11, 6.1.0-rc0 |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Pierlauro Sciarelli | Assignee: | Pierlauro Sciarelli |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Backport Requested: |
v6.0, v5.0
|
||||||||||||
| Sprint: | Sharding EMEA 2022-08-08 | ||||||||||||
| Participants: | |||||||||||||
| Case: | (copied to CRM) | ||||||||||||
| Description |
|
Issue Status as of Aug 10, 2022 ISSUE DESCRIPTION AND IMPACT In MongoDB 5.0.0-5.0.10 and 6.0.0, when running featureCompatibilityVersion 5.0+; the movePrimary command can cause inconsistent sharding metadata when the target database for the command was created while under featureCompatibilityVersion 4.4 or earlier. This issue is fixed in MongoDB 5.0.11 and 6.0.1. As a result, after a movePrimary operation:
The movePrimary command performs an update to the config.databases collection to complete the operation of changing a database's primary shard. This issue occurs because the update filter for this query checks for equivalency to a subdocument, instead of each field in the subdocument individually using dotted notation. This is an incorrect practice and prevents the update from matching (and therefore updating) the necessary metadata due to the following differences in how metadata is stored and managed:
The movePrimary command's improper use of a document that matches the FCV 5.0 version format causes updates to miss metadata documents that were converted during setFeatureCompatibilityVersion from earlier versions. DIAGNOSIS If a database in a sharded cluster was created on MongoDB 4.4 or earlier, and the cluster is currently running MongoDB versions 5.0.0-5.0.10 or 6.0.0, the cluster is vulnerable and likely to be impacted during a movePrimary command. Signs a cluster has been impacted include one or more of the following:
WORKAROUND For MongoDB Atlas customers: Please open a support case or start a chat with the Atlas Support team to coordinate this workaround if you have an immediate need to reduce shard count. Otherwise, do not reduce shard count or run movePrimary until upgraded to MongoDB versions 5.0.11 or 6.0.1. For all other users (including Ops Manager and Cloud Manager Customers): The following command modifies config server metadata to match the format expected by the incorrect codepath, and allow subsequent movePrimary commands to complete correctly. If you are on MongoDB Ops Manager, or Cloud Manager, these steps also make it safe to reduce shard count. Prior to performing a movePrimary operation (or reducing shard count in Cloud or Ops Manager) on a vulnerable cluster, run the following command from a mongos router:
REMEDIATION If you have been impacted: 1. Stop writes to affected and related collections.
4. For each unsharded collection in the affected database, merge the data from the source primary shard to the destination primary shard. Manual conflict resolution may be required for upserts, and you may also need to identify documents which should be deleted and address documents which have not been correctly updated. Assuming no conflicts, one way to perform this process is:
Important: If you have run multiple movePrimary commands with differing arguments, then data must be merged from the source primary shard and all shards that have been the destination primary shard of a movePrimary operation. 5. For each unsharded collection in the affected database, drop the collection directly on the source primary shard. Do not drop on a mongos router. Note: "destination primary shard" in this context is the intended primary shard for the movePrimary operation. Original descriptionShort summary of the problem Calling movePrimary on any database that was created under any FCV pre-v5.0 results in a no-op update on the config.database entry. The result is that unsharded collections get moved on the destination primary shard but are inaccessible via mongos because the metadata still point to the source primary shard. Root cause
The update of the primary field in config.databases entries performed as part of movePrimary has a filter containing a nested BSON for the version field. This is wrong since it means the query is relying on the order of the fields and will not match documents with the exact same fields but in a different order. Using the dotted notation for the update would solve the issue. The filter was originally introduced under Steps to reproduce: Apply the following patch to upgrade_downgrade_sharded_cluster.js on the v5.0 branch (tried on revision 80418c74):
|
| Comments |
| Comment by Pierlauro Sciarelli [ 05/Aug/22 ] |
|
Author: {'name': 'Pierlauro Sciarelli', 'email': 'pierlauro.sciarelli@mongodb.com', 'username': 'pierlauro'}Message: |
| Comment by Githook User [ 05/Aug/22 ] |
|
Author: {'name': 'Pierlauro Sciarelli', 'email': 'pierlauro.sciarelli@mongodb.com', 'username': 'pierlauro'}Message: |
| Comment by Pierlauro Sciarelli [ 04/Aug/22 ] |
|
Author: {'name': 'Pierlauro Sciarelli', 'email': 'pierlauro.sciarelli@mongodb.com', 'username': 'pierlauro'}Message: |