[SERVER-33556] range scan query optimizing Created: 28/Feb/18  Updated: 23/Apr/18  Resolved: 26/Mar/18

Status: Closed
Project: Core Server
Component/s: Querying, Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Matthew Kruse Assignee: Kyle Suarez
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-13065 Consider a collection scan even if in... Backlog
Participants:

 Description   

I'm going to try and describe an IXSCAN performance optimization.

Imagine a shard collection scenario. This collection is using 2TB of storage on disk. All queries to this collection will do index range scans on the shard key. These range scans will very often query 10, 20, etc, multiple chunks on disk. These queries will also have some regex or filters on parts of the document that are not in the shard key. Basically once the document is found to have meet the shard key bounds, we always have to inspect the contents of the document to know if it should be returned or not.

In scenarios like this one, mongo's query optimizer will have each replica set execute a IXSCAN operation to find and filter on the documents. For performance reasons, I believe in scenarios like this one Mongo should always full collection scan as the chunk shard key bounds effectively make doing IXSCAN operations unnecessary. We already know every document or a large portion of the documents in the chunk are going to have to be scanned. In cases like this a COLLSCAN operation is far more efficient.

I've seen this behavior happen on range scans on shard keys on small percentages of the documents in a collection. I've seen the optimizer pick this behavior when the query bounds would target every document//chunk in a sharded collection as well. In both of these cases full collection scanning is the best option.

Ideally what I think should happen is:
1. Mongos figures out what data chunks have data for the range bounds of the query on the shard key like it currently does
2. Mongos sends the query down to mongod
3. Mongod's optimizer recognizes that a collection scan is more efficient and does that instead of an index scan

If option 3 can't happen maybe a special query hint that isn't a full collection scan query hint, but a query hint that says, do a full data chunk scan on anything that is left after we filter out all the unnecessary data chunks using the bounds provided on the shard key.

If you need more info or don't understand what I'm trying to describe, I'm happy to go into even more detail.



 Comments   
Comment by Kyle Suarez [ 26/Mar/18 ]

Hey mkruse@adobe.com,

In a sharded cluster, mongos will perform shard targeting to target only those shards that contain chunks relevant for the query. After that, it forwards the command to those servers and it's up to them to decide what plan is best. I'd say that SERVER-13065 would be the general-case solution for both sharded and unsharded setups, so I'm going to close this as a duplicate. You can watch that ticket for updates.

Thanks for taking the time to file this improvement and make MongoDB better

Regards,
Kyle

Comment by Matthew Kruse [ 23/Mar/18 ]

Kyle, the issue you linked to this one is effectively the same problem. The optimizer needs to do a better job of recognizing when to abandon a index range scan and skip to a collection scan in certain cases. I could see this happening in sharded and unsharded mongo setups.

Assuming mongod can do chunk pruning based off a shard key or an index range, the other bug you referenced is effectively the same problem. If mongod behaves differently in this respect in a sharded or unsharded setup, then this issue is distinct.

I think treating an unsharded setup's _id index as the 'shard key' is the same thing as a sharded configuration as data chunks are built in ranges off of these in both cases.

Comment by Kyle Suarez [ 23/Mar/18 ]

Hi mkruse@adobe.com,

Even in non-sharded environments, it would make sense to consider collection scans when an index scan would be unselective, which is described in SERVER-13065. Would that satisfy your feature request? If so, I'd like to close this ticket as a duplicate to track the improvement in one place.

Best,
Kyle

Comment by Kelsey Schubert [ 02/Mar/18 ]

Thanks for the improvement request, mkruse@adobe.com. I've sent it to the Sharding Team for consideration.

Generated at Thu Feb 08 04:33:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.