[SERVER-68541] Concurrent removeShard and movePrimary may delete unsharded collections Created: 03/Aug/22 Updated: 29/Oct/23 Resolved: 31/Aug/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 6.1.1, 6.0.3, 6.2.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Silvia Surroca | Assignee: | Antonio Fuschetto |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | data-loss | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Requested: |
v6.1, v6.0, v5.0, v4.4, v4.2
|
||||||||||||
| Steps To Reproduce: | repro-undesired-unsharded-collections-remove.patch |
||||||||||||
| Sprint: | Sharding EMEA 2022-08-22, Sharding EMEA 2022-09-05 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Concurrent removeShard and movePrimary may end up with an undesired delete of unsharded collections. Bug description
At some point, someone decides to call concurrently these commands:
Then, if the sequence of the internal executions are the written below, the cluster will end up with an undesired deletion of all the unsharded collections of 'myDB'. 1. removeShard command is called to the config server Small note to understand better the 2nd bullet: the removeShard command returns a non completed status if the shard still have unsharded databases and notifies the user that those should be moved explicitly using movePrimary. A better explanation can be found here. |
| Comments |
| Comment by Githook User [ 08/Nov/22 ] |
|
Author: {'name': 'Antonio Fuschetto', 'email': 'antonio.fuschetto@mongodb.com', 'username': 'afuschetto'}Message: |
| Comment by Githook User [ 08/Nov/22 ] |
|
Author: {'name': 'Antonio Fuschetto', 'email': 'antonio.fuschetto@mongodb.com', 'username': 'afuschetto'}Message: |
| Comment by Githook User [ 29/Aug/22 ] |
|
Author: {'name': 'Antonio Fuschetto', 'email': 'antonio.fuschetto@mongodb.com', 'username': 'afuschetto'}Message: |
| Comment by Githook User [ 29/Aug/22 ] |
|
Author: {'name': 'Antonio Fuschetto', 'email': 'antonio.fuschetto@mongodb.com', 'username': 'afuschetto'}Message: |
| Comment by Antonio Fuschetto [ 09/Aug/22 ] |
Proposed solutionFollowing the logic currently implemented to commit the chunk migration, it seems natural to adopt the same approach consisting in 1) to expose a new config server command (i.e., _configsvrCommitMovePrimary) to atomically commit the configuration changes required by the movePrimary command, and 2) to synchronize these configuration changes with the removeShard command (reusing an existing mutex). This solution serializes the configuration changes of concurrent invocations of the removeShard and movePrimary commands and then resolves the bug in question. Backward compatibilityDepending on the versions to which the fix needs to be back-ported (potentially all), the donor shard could fall back into the current logic (consisting in finding the current primary shard for the given database and then committing changes) if the new config server command is not exposed (e.g. in a multiversion deployment). Also, up to 5.0 version, the config server already exposes the _configsvrCommitMovePrimary command and it would be interesting to understand why the logic was changed. Was the goal to decentralize the configuration server logic? However, the idea (to be validated) is to use the same command to have a compatible solution with 5.0 version and lower. |
| Comment by Kaloian Manassiev [ 04/Aug/22 ] |
|
Hmm, will this actually get fixed just by the movePrimary coordinator implementation by itself? There is nothing that prevents even at commit time of the shard removal that the commit of the new placement will not happen after the shard has been removed. I don't think we need to wait until the Add/Remove Shard project, but the move primary commit needs to become a command on the CSRS which serialises with the shard removal lock. CC antonio.fuschetto@mongodb.com to keep in mind. |
| Comment by Cris Insignares Cuello [ 04/Aug/22 ] |
|
kaloian.manassiev@mongodb.com antonio.fuschetto@mongodb.com Considering as part of Sharding First we are going to rewrite the MovePrimary coordinator, we should also fix this one. |