[SERVER-62310] collMod command not sent to all shards for a sharded collection if no chunks have been received Created: 29/Dec/21 Updated: 13/Sep/23 Resolved: 09/Jun/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Alex Bevilacqua | Assignee: | Enrico Golfieri |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | shardingemea-qw | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Sharding EMEA
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Sharding 2022-03-07, Sharding NYC 2022-03-21, Sharding NYC 2022-04-04, Sharding NYC 2022-05-30, Sharding 2022-06-27, Sharding EMEA 2023-06-12 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| Story Points: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This ticket is an extension of For example, given a 2 shard sharded cluster:
If we issue a collMod while connected to the cluster via a mongos the primary shard will be updated, but any other shards that had this collection created via the shardCollection command will not. For example, if we send the following command we would expect the validationLevel to be updated to off from the default of strict:
If we connect to each shard individually now and run db.getCollectionInfos() the result is that shard01 in our cluster has the updated collection metadata whereas shard02 doesn't:
|
| Comments |
| Comment by Jack Mulrow [ 22/Jul/22 ] |
|
I didn't finish this before my sabbatical, so I'm moving it to the backlog. That said, I have a couple things to add:
That's partially true, but the main motivation was that when we didn't use the shard version protocol index operations had the same problem as multi-writes where they can apply 0, 1, or more times on a particular shard with an unlucky interleaving of chunk migrations despite returning ok:1 to the user. Using the shard version protocol was meant to avoid that. For point 1), I agree an option is to revert the collMod related changes from For 2), we solved a similar issue for create/dropIndexes by having a shard receiving its first chunk drop any indexes it has locally that are not on the donor shard in As for 3), someone should verify this, but to handle create/dropIndex commands concurrently running with chunk migrations leading to divergent indexes, we made create/dropIndexes abort any active chunk migrations when they complete, so a migration can only finish if the collection's indexes haven't changed since they were copied at the beginning of the migration, and we made the same change for collMod. So I don't believe this is an issue (beyond the problem from point 2), which I believe is a problem). And finally, the switch to using DDL coordinators for these operations likely fixed a lot of these issues, so it's possible these problems only exist on earlier branches. |
| Comment by Kaloian Manassiev [ 16/Jan/22 ] |
|
CC marcos.grillo who is also working on something related to collMod, which is now written as a DDL Coordinator. |
| Comment by Max Hirschhorn [ 29/Dec/21 ] |
The collMod command not targeting a shard if it doesn't own chunks for the collection and isn't the primary shard for the database is by design ( Failed chunk migrations don't clean up the created collection. Moreover, nothing in the system remembers the chunk migration had been attempted and didn't succeed. (The routing table only records shards which actively own chunks for the sharded collection.) The recipient shard therefore may have a stale notion of the sharded collection. While I don't think we would want to go back to having collMod target all shards in the cluster, I think there are a few related problems involving collMod here:
|