[SERVER-3294] Ability to keep data on disk in ~ index order Created: 19/Jun/11  Updated: 06/Dec/22  Resolved: 13/Sep/21

Status: Closed
Project: Core Server
Component/s: Index Maintenance, MMAPv1, Sharding, Storage
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Benjamin Darfler Assignee: Backlog - Storage Execution Team
Resolution: Won't Do Votes: 16
Labels: gm-ack
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-15354 Implementing cluster index for RocksDB Closed
is duplicated by SERVER-31357 Using Indexes To Sort Huge Data Closed
Related
is related to SERVER-9114 Add balancer strategy that balances b... Blocked
Assigned Teams:
Storage Execution
Participants:

 Description   

If all data points in a chunk were located sequentially on disk then migrations could be done using only sequential IO and not random IO. Additionally, it would not fragment the data files as it currently does when all chunks are intermingled.



 Comments   
Comment by Geert Bosch [ 08/Mar/21 ]

I think we should close this as Won't Do. With WiredTiger, and supposedly any other advance modern storage engine, the storage engine manages its own I/O. The nature of this I/O, as well as OS and filesystem abstractions, as well as the changing hardware characteristics including device-level block-remapping for defects or wear leveling of flash memory, make "disk-order" not well defined.

Clustered indexes will ensure there is some locality when accessing documents in the same neighborhood according to the _id index order, whether the clusters are in disk order or not.

Comment by Louis Williams [ 08/Mar/21 ]

While this is related to clustered indexes, this request is asking for collections to not just be logically in _id-order, but be physically stored on disk in _id-order. This is really a storage format request and I am putting this back in the backlog.

Comment by Mark Callaghan [ 18/Oct/15 ]

I don't see SSD replacing disk. Some people don't want to pay the cost difference, others need devices that can sustain higher write rates. For the same reason many deployments won't run on servers with 1T of RAM to cache their entire database. Just because they can doesn't mean they will pay to do so.

However, there is efficiency to be gained with proper support for a clustered index. I hope that MongoDB can remove the hidden index used by RocksDB & WiredTiger. When running Linkbench on MongoDB the insert rate is much slower than on MySQL and part of the reason is extra indexes.
http://smalldatum.blogspot.com/2015/07/linkbench-for-mysql-mongodb-with-cached.html

Comment by Markus Mahlberg [ 18/Oct/15 ]

As far as I can see, this would only show real benefit for mmapv1 storage on spinning disks – WiredTiger is COW anyway. And SSDs are highly encouraged for production (for a reason).

So, forgive my french, this would only benefit users who are trying to save a few bucks by using spinning disks. And of those "only" the ones using the mmapv1 storage engine, since on WiredTiger the documents would not be in index order after the first change of a document any more.

Changing the way chunk splits and migrations work would impact all users, most likely with negative side effects on the performance side. Admittedly, this would only impact collections with a bad shard key which causes frequent chunk migrations, but it presumably would make the situation worse.

Another point is that sooner or later spinning disks will be replaced by SSDs. So this feature basically is added for being able to stick to a "dying" technology to save money. I am not sure wether that makes sense. From my experience, the utilization of spinning disks leads to the disk IO being the bottleneck of a deployment, requiring more shards to be added to handle the total load. In at least two instances I was able to reduce the number of shards of my customers by half by scaling up a few shard nodes to SSDs and increasing the RAM accordingly, saving a big deal of money each month.

That being said: If in doubt, and for my part I am a great deal, adhering to KISS is the better choice here.

Comment by Kevin J. Rice [ 06/Aug/13 ]

The average Linux user will have the default configuration of 'cfq' IO scheduler. This lends itself to sequential IO, but not to random IO, since it tries to delay requests until it can see if any of them are going to the same area of disk. An experienced/advanced MongoDB user will configured their mounts as 'noop' to speed things up (for data file writing, not for journal writing).

With this in mind, when migrating a chunk (thus cloning it), the data could be written/cloned by the receiving shard in index order, and written that way. Thus, any chunk migrations would automatically rewrite the data in index order, solving this case. This would speed IO since it would be a sequential write of more than one document at a time. Ideally, the core server code would bunch up the write as a set sequential set of bytes as opposed to a document at a time, if this is not already the case.

So, to solve this, just:

(a) ensure that migrated chunk data is written in index order, and
(b) stimulate the balancer to move all the chunks at least once.

Comment by Nick Gerner [ 18/Oct/12 ]

Any updates on this? I would love to have clustered indexes. I'm currently running large batch inserts and trying to insert them in roughly my desired order on disk, but I've got an index which is the actual order I need to do large range scans, so if that could be clustered, that would be even better.

However, that index is not on the shard key in our case (we want to spread large reads over all the shards to get better parallelism and reduce latency). Does that mean you're not planning on supporting our scenario?

Comment by Eliot Horowitz (Inactive) [ 20/Jun/11 ]

We're planning on doing something similar to this, but not quite the same.
We're not going to do physical files per chunk for a few reasons:

  • splits become expensive
  • for indexes, would either have to do 1 per file which means search is very expensive, or if you don't then that's the bottleneck
Comment by Benjamin Darfler [ 19/Jun/11 ]

However those reads and writes could be sequential.

Comment by Benjamin Darfler [ 19/Jun/11 ]

This might be accomplished by separate files per chunk. It would then mean that a chunk split is more expensive as the chunk would have to be read in and written out based on the new key ranges.

Generated at Thu Feb 08 03:02:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.