[SERVER-61501] Create sharding suite where collections are clustered by default Created: 15/Nov/21  Updated: 29/Oct/23  Resolved: 27/Jan/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.3.0

Type: Task Priority: Major - P3
Reporter: Haley Connelly Assignee: Haley Connelly
Resolution: Fixed Votes: 0
Labels: PM-2311-M2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-62874 Investigate unknown sharding_clustere... Closed
Related
related to SERVER-62688 Verify clustered collections with col... Closed
Backwards Compatibility: Fully Compatible
Sprint: Execution Team 2021-12-27, Execution Team 2022-01-10, Execution Team 2022-01-24, Execution Team 2022-02-07
Participants:

 Description   

Similar to the passthrough suite, it would be nice to have a similar sharding suite.

One issue in this could potentially be running into duplicate key errors if the collection is clustered by _id, but sharded on a different shardKey



 Comments   
Comment by Githook User [ 27/Jan/22 ]

Author:

{'name': 'Haley Connelly', 'email': 'haley.connelly@mongodb.com', 'username': 'haleyConnelly'}

Message: SERVER-61501 Create sharding suite where collections are clustered by default
Branch: master
https://github.com/mongodb/mongo/commit/e164041d38bc3c1ac0e902ddd432cfb996fb9e58

Comment by Haley Connelly [ 05/Jan/22 ]

max.hirschhorn The goal of this was to take tests in jstests/sharding directory and modify them to be sharding on a clustered collection.

These tests do exercise chunk migration, which is what we are aiming to test.

For the duplicate key case, I was (not so clearly) referring to the issue that the cluster key must be unique - like a global unique index. We currently don't support unique indexes on a sharded collection when the collection is sharded on a different shard key.

Edit: however - we don't currently enforce global _id uniqueness upon insert.
suppose we have a regular sharded collection: * chunks [a: minKey, a: 0), [a: 0, a: maxKey)

  • insert({a: -100, _id: 1})
  • insert ({a: 20, _id: 1})

 we don't error until trying to do chunk migration. Thus, we should be okay with just altering tests to create clustered collection by default.

Comment by Max Hirschhorn [ 20/Nov/21 ]

One issue in this could potentially be running into duplicate key errors if the collection is clustered by _id, but sharded on a different shardKey

The choice of {_id: "hashed"} is for a couple reasons:

  1. The chunk ranges will be pre-split for a hashed shard key. It is possible to achieve the same effect for range-based sharding using zones, for example, see CreateShardedCollectionUtil.shardCollectionWithChunks().
  2. Every document written across all tests will have an _id field. This means it is very likely for {_id: "hashed"} to end up with documents on both shards. If instead the field was missing most of the time, then all of those documents would end up in the same chunk on the same shard. An alternative which would be suitable for range-based sharding could have been to inject a field with a randomly-generated value via a runCommand() override.

Notably, the goal of pre-splitting and distributing the collection data across multiple shards is to exercise the targeting logic in mongos where some operations must broadcast to all shards and isn't to exercise chunk migration. In general, I skeptical of the amount of chunk migration which ever happens via the balancer in our sharding passthrough testing, including for the concurrency suites.

I couldn't tell from the description what the interesting interaction(s) are between clustered collections and sharded collections for why an {_id: "hashed"} shard key pattern might be undesirable.

Generated at Thu Feb 08 05:52:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.