[SERVER-53432] Ensure that comparing resharding requested zones with the authoritative tags is protected by concurrency control Created: 18/Dec/20 Updated: 27/Oct/23 Resolved: 27/Oct/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 5.0.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Blake Oler | Assignee: | Kaloian Manassiev |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | PM-234, PM-234-M3, PM-234-T-lifecycle, sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||||||||||
| Sprint: | Sharding EMEA 2021-05-31, Sharding EMEA 2021-11-01 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Story Points: | 2 | ||||||||||||||||||||
| Description |
|
As of now, it's possible that the contents of config.tags can be changed in-between the time at which config.tags is queried, and when the final set of requested zones is written to the temporary collection. There are a couple of concurrency constraints that can be used, one which seems necessary, and the other which may be necessary but would require additional thought.
|
| Comments |
| Comment by Kaloian Manassiev [ 27/Oct/21 ] | |||||||||||||||||
|
Per discussion with Max and based on his comment above, we decided that the risk of this happening is of someone is changing collection's zones concurrently with Resharding. Given that the fix is not trivial (we need to start taking and recovering the Zones lock as part of the resharding coordinator) and the probability of it happening is really low with a very benign side effect, we will not fix it. | |||||||||||||||||
| Comment by Max Hirschhorn [ 27/Oct/21 ] | |||||||||||||||||
It looks like the balancer will simply log a warning in this situation so the only real risk is log spam until the user takes action on it. I think it would be fine to close
| |||||||||||||||||
| Comment by Max Hirschhorn [ 15/Oct/21 ] | |||||||||||||||||
|
Resharding is inserting config.tags documents associated with the temporary resharding collection without holding the _kZoneOpLock. I believe this means if there's a concurrent removeShardFromZone command, then config.tags documents can be created that refer to a nonexistent zone name. Is that a concern to the Sharding EMEA team? | |||||||||||||||||
| Comment by Connie Chen [ 14/Oct/21 ] | |||||||||||||||||
|
max.hirschhorn - Could you let the Sharding EMEA team know if this still needs to be done and the priority. | |||||||||||||||||
| Comment by Janna Golden [ 17/May/21 ] | |||||||||||||||||
|
We removed the constraint that any zones passed by the user must use an existing zone name if zones for the collection already exist in config.tags (so the validateZones() function was removed). Now, we only assert that if zones already exist in config.tags, the user passes something for the zones parameter. The assertion that a zone is associated with a shard happens during the initial splitting, similar to shardCollection. We do not take the _kZoneOpLock when reading from config.shards in either case. | |||||||||||||||||
| Comment by Kaloian Manassiev [ 13/May/21 ] | |||||||||||||||||
|
The idea of using allowMigrations actually doesn't sound bad to me at all. The meaning of allowMigrations is "I want this collection to stay frozen with respect to its placement". Zones count as part of the placement of the collection, so we can just make it so that you can't add/remove zones while it is set. | |||||||||||||||||
| Comment by Max Hirschhorn [ 20/Feb/21 ] | |||||||||||||||||
|
kaloian.manassiev, what happens if a config.tags document is inserted which references a zone name that doesn't actually exist? I see that ShardingCatalogManager::assignKeyRangeToZone() uses the _kZoneOpLock to prevent the zone name from being removed while a chunk referring to it is being added. I also see we prevent removing a zone name when there is still a chunk which references it. This makes me feel like the user input validation should occur under the _kZoneOpLock at the time of writing to the config.chunks collection. I think it is fine not use the _kZoneOpLock when doing user input validation before then. Duplicating the user input validation is likely beneficial because calculating the config.chunks documents for the temporary resharding collection is a moderately expensive operation that the user shouldn't need to wait for it if they've made a typo. On a semi-related note, it seems like validateZones() is using an incomplete set of zone names. The {reshardCollection: ..., zones: [...]} syntax is required if the collection being resharded has any zones - ShardingCatalogClient::getTagsForCollection() is fine for that purpose. However, there is no requirement that the zone names for the collection after it has been resharded are the same as the zone names for the collection before it had been resharded. We should instead be passing all zone names that exist across all shards to validateZones(). Alternatively, we could try and reuse ShardingCatalogManager::assignKeyRangeToZone() to do the validation. |