[DRIVERS-1853] Clustered Indexes for all Collections Created: 15/Jul/21  Updated: 28/Oct/23  Resolved: 13/Feb/23

Status: Closed
Project: Drivers
Component/s: Collection Management
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Abraham Egnor
Resolution: Fixed Votes: 0
Labels: size-small, spec-change
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by MONGOSH-1172 Support creating collections with clu... Closed
is depended on by VSCODE-330 Include the clusteredIndex option in ... Closed
Issue split
split to JAVA-4576 Clustered Indexes for all Collections Closed
split to PHPLIB-843 Clustered Indexes for all Collections Closed
split to RUST-1271 Clustered Indexes for all Collections Closed
split to CDRIVER-4359 Clustered Indexes for all Collections Closed
split to CSHARP-4141 Clustered Indexes for all Collections Closed
split to CXX-2491 Clustered Indexes for all Collections Closed
split to GODRIVER-2383 Clustered Indexes for all Collections Closed
split to MOTOR-935 Clustered Indexes for all Collections Closed
split to NODE-4189 Clustered Indexes for all Collections Closed
split to PYTHON-3227 Clustered Indexes for all Collections Closed
split to RUBY-2959 Clustered Indexes for all Collections Closed
Related
related to DRIVERS-2325 Add commandStartedEvent assertions to... Implementing
Driver Changes: Needed
Server Compat: 5.3
Quarter: FY23Q2
Upstream Changes Summary:

We're planning to add new fields to command responses and change output of listIndexes. Scope will have more details.

Downstream Changes Summary:

Drivers will need to:

  • add the clusteredIndex option for createCollection
  • add the clustered field in the output of listIndexes
  • sync collection-management tests to b042e47

Update: serverless: forbid was added in this commit

Driver Compliance:
Key Status/Resolution FixVersion
CDRIVER-4359 Fixed 1.22.0, 1.22.0-beta0
CXX-2491 Done 3.8.0
CSHARP-4141 Fixed 2.16.0
GODRIVER-2383 Done 1.10.0, 1.10.0-beta1
JAVA-4576 Fixed 4.7.0
NODE-4189 Fixed 4.6.0
MOTOR-935 Won't Do
PYTHON-3227 Done
PHPLIB-843 Fixed 1.13.0-beta1, 1.13.0
RUBY-2959 Fixed 2.18.0
RUST-1271 Fixed 2.3.0
SWIFT-1546 Won't Fix

 Description   
Downstream Change Summary

We're planning to add new fields to command responses and change output of listIndexes. Scope will have more details.

Description of Linked Ticket

Epic Summary

Summary

Without clustering, a collection is stored in a B-Tree by a RecordId that is not exposed to end users, and there is a primary key index (<primary key>, <RecordId>). With clustering, a collection is to be stored in a B-Tree by the collection’s primary key, and there is no primary key index. This project is a generalization of clustering for time series (PM-288), and will need to support upgrading existing collections to use clustering.

Motivation

Clustering by primary key is important for fast scale in/out in Serverless. This is largely because split and merge, which will do a physical copy such as file copy, will replace tenant migration/chunk migration, which does a logical copy.

  • If a tenant does not have local secondary indexes (e.g., only has global indexes), orphan cleanup can be done using truncate rather than individual document deletes. Orphan filtering is expensive, so fast orphan cleanup is particularly important when doing a physical copy. This is because with a logical copy, the recipient can only end up with orphans in the range being transferred, but with a physical copy, the recipient can end up with orphans outside the range being transferred (i.e., more orphans). Orphans also block the merge of two slices that were split from each other, since merge has to be on disjoint ranges.
  • WT data tables for disjoint primary key ranges can be presented as a single table in constant time, for example by adding a root node above the two tables. This can significantly speed up merge, especially if combined with providing a union-view over any local secondary index tables. The tables can actually be merged into one file in the background.

General benefits of clustering include:

  • Faster lookup and range scans by primary key because you don't need to go through the primary key index.
  • Faster orphan filtering for covered local index queries because local index entries contain the primary key.

One downside is clustering may consume more space if there are local secondary indexes, since the primary key index reduces the number of copies of each primary key value

Cast of Characters

Documentation

Product Description
Scope Document
Technical Design Document



 Comments   
Comment by Githook User [ 16/May/22 ]

Author:

{'name': 'Jeff Yemin', 'email': 'jeff.yemin@mongodb.com', 'username': 'jyemin'}

Message: Add serverless: forbid for collection management tests (#1216)

DRIVERS-1853
DRIVERS-2294
Branch: master
https://github.com/mongodb/specifications/commit/d1458823bd810014df9da16d3a5354d2269ab865

Comment by Githook User [ 27/Apr/22 ]

Author:

{'name': 'Abraham Egnor', 'email': 'abraham.egnor@mongodb.com', 'username': 'abr-egn'}

Message: DRIVERS-1853 Add tests and update spec to support API changes for clustered collections (#1182)
Branch: master
https://github.com/mongodb/specifications/commit/b042e47e1f978950030dd678134cdeed8693c748

Comment by PM Bot [ 19/Jan/22 ]

If you are not logged in, you can view the tickets in this epic by following this link.

Comment by PM Bot [ 18/Aug/21 ]

Moved to Needs Triage because a linked PM issue (PM-2311)was moved to Ready for Work.

Generated at Thu Feb 08 08:24:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.