[SERVER-77899] remove repeat contain index from the relevant candidate indexes, this can avoid some useless calculations Created: 08/Jun/23  Updated: 24/Jun/23  Resolved: 22/Jun/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: y yz Assignee: Backlog - Query Optimization
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Optimization
Participants:

 Description   

remove repeat contain index from the relevant candidate indexes, this can avoid some useless calculations.
for example:
   {a:1, b:1} contain {a:1}, so we can remove index {a:1} from the candidates
   {a:"hashed", b:1} contain {a:"hashed"}, so we can remove index {a:"hahsed"} from the candidates

 {a:1, b:1} is better than {a:1},because both cases are satisfied for db.collection.find({a:xx}) 
    and db.collection.find({a:xx,b:xx})



 Comments   
Comment by y yz [ 24/Jun/23 ]

(y)got it, thanks.

Comment by Ben Shteinfeld [ 23/Jun/23 ]

> why it's not safe to choose

{a:1, b:1}

over

{a:1}

It is not a query correctness/safety issue as much as it is a performance one. Consider a situation where the field 'b' is multikey (the 'b' field contains arrays). This means that any given document, the

{a:1, b:1}

index has duplicate entries for 'a' for each element in the 'b' array. If you have a query which only references 'a', then using the

{a:1}

index is preferable to the

{a:1, b:1}

index because we avoid examining duplicate key entries.

Comment by y yz [ 23/Jun/23 ]

hi, Steve Tarzia. Thank you for your reply, but I have some different ideas.
1. We have no control over user behavior.
    I'm from tencent cloud mongodb team. According to the statistics of tencent cloud mongodb, in replication cluster: About 5 to 10 percent of cluster have this proble(have over duplicate indexes).

    in sharding cluster: many times, although the user not create duplicate indexes explicitly, but sh.shardCollection() may make a duplicate implicitly. For example, the following scenario is very common in production environments:

mongos> 
mongos> use tradeDB
switched to db tradeDB
mongos> 
mongos> 
mongos> sh.enableSharding("tradeDB")
{
        "ok" : 1,
        "$clusterTime" : {
                "clusterTime" : Timestamp(1687534114, 2),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        },
        "operationTime" : Timestamp(1687534114, 1)
}
mongos> 
mongos> sh.shardCollection("tradeDB.userTradeInfo",{"userid":1})
{
        "collectionsharded" : "tradeDB.userTradeInfo",
        "ok" : 1,
        "$clusterTime" : {
                "clusterTime" : Timestamp(1687534139, 13),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        },
        "operationTime" : Timestamp(1687534139, 9)
}
mongos> show tables
userTradeInfo
mongos> db.userTradeInfo.getIndexes()
[
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_"
        },
        {
                "v" : 2,
                "key" : {
                        "userid" : 1
                },
                "name" : "userid_1"
        }
]
mongos> 
mongos> 
mongos>  db.userTradeInfo.createIndex({userid:1, time:1})
{
        "raw" : {
                "mongodb_5.0_shard2/9.134.242.181:19100,9.134.242.181:19101" : {
                        "numIndexesBefore" : 2,
                        "numIndexesAfter" : 3,
                        "createdCollectionAutomatically" : false,
                        "commitQuorum" : "votingMembers",
                        "ok" : 1
                }
        },
        "ok" : 1,
        "$clusterTime" : {
                "clusterTime" : Timestamp(1687534164, 1),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        },
        "operationTime" : Timestamp(1687534164, 1)
}
mongos> 
mongos> 
mongos> db.userTradeInfo.getIndexes()
[
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_"
        },
        {
                "v" : 2,
                "key" : {
                        "userid" : 1
                },
                "name" : "userid_1"
        },
        {
                "v" : 2,
                "key" : {
                        "userid" : 1,
                        "time" : 1
                },
                "name" : "userid_1_time_1"
        }
]
mongos> 

       the tradeDB.userTradeInfo storage user's history trade information, the shard key is "userid" that can ensure the data of the same user resides in the same shard; the main query sql is "db.userTradeInfo.find({userid:xxxx, time:{$gte:2022-xx-xx, $lte:2022-xx-xx}})", so the dba create index {userid:1, time:1}.  for the query sql "db.userTradeInfo.find({userid:xxxx, time:{$gte:2022-xx-xx, $lte:2022-xx-xx}})", if not delete index {userid:1}, the two index {userid:1} and {userid:1, time:1}  are the candidate index. befor the plan is cached,two indexes need to be scored; after the plan is cached, the all query shold use CachedPlanStage::pickBestPlan to determine whether  replan is necessary.

 

   if we delete the contain duplicate index(here is {userid:1}), Only one index plan({userid:1, time:1}) is available, this saves a lot of computing resources and improves performance at the same time
   

2. if query sql's candidate indexes contain {a:1} and {a:1, b:1}, the {a:1, b:1}'s rank score must >= {a:1}'s rank score.

     Associative indexing principle(the leftmost rule),I think the {a:1, b:1}'s rank score must >= {a:1}'s rank score certainly.

     if the rank score of the two index is same,review the index query code, the first candidate index is selected as the best index by default.  if we create index {a:1, b:1},  and then create index {a:1},  if the rank score of the two index is same,  the {a:1, b:1} will be the best winning index, the test result is Consistent, as following:

mongos> db.userTradeInfo.getIndexes()
[
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_"
        },
        {
                "v" : 2,
                "key" : {
                        "userid" : 1
                },
                "name" : "userid_1"
        },
        {
                "v" : 2,
                "key" : {
                        "userid" : 1,
                        "time" : 1
                },
                "name" : "userid_1_time_1"
        }
]
mongos> 
mongos> db.test.find({a:1,b:1})
mongos> 
 
 
the logs:
{"t":{"$date":"2023-06-24T00:00:49.606+08:00"},"s":"D2", "c":"QUERY",    "id":20971,   "ctx":"conn50","msg":"Relevant index","attr":{"indexNumber":0,"index":"kp: { a: 1.0, b: 1.0 } name: '(a_1_b_1, )' io: { v: 2, key: { a: 1.0, b: 1.0 }, name: \"a_1_b_1\" }"}}
{"t":{"$date":"2023-06-24T00:00:49.606+08:00"},"s":"D2", "c":"QUERY",    "id":20971,   "ctx":"conn50","msg":"Relevant index","attr":{"indexNumber":1,"index":"kp: { a: 1.0 } name: '(a_1, )' io: { v: 2, key: { a: 1.0 }, name: \"a_1\" }"}}
{"t":{"$date":"2023-06-24T00:00:49.608+08:00"},"s":"D2", "c":"QUERY",    "id":20957,   "ctx":"conn50","msg":"Scoring query plan","attr":{"planSummary":"IXSCAN { a: 1, b: 1 }","planHitEOF":true}}
{"t":{"$date":"2023-06-24T00:00:49.608+08:00"},"s":"D2", "c":"QUERY",    "id":20961,   "ctx":"conn50","msg":"Score formula","attr":{"formula":"score(1.7502) = baseScore(1) + productivity((3 advanced)/(4 works) = 0.75) + tieBreakers(0 noFetchBonus + 0.0001 noSortBonus + 0.0001 noIxisectBonus = 0.00020000000000000001)"}}
{"t":{"$date":"2023-06-24T00:00:49.608+08:00"},"s":"D2", "c":"QUERY",    "id":20957,   "ctx":"conn50","msg":"Scoring query plan","attr":{"planSummary":"IXSCAN { a: 1 }","planHitEOF":true}}
{"t":{"$date":"2023-06-24T00:00:49.608+08:00"},"s":"D2", "c":"QUERY",    "id":20961,   "ctx":"conn50","msg":"Score formula","attr":{"formula":"score(1.7502) = baseScore(1) + productivity((3 advanced)/(4 works) = 0.75) + tieBreakers(0 noFetchBonus + 0.0001 noSortBonus + 0.0001 noIxisectBonus = 0.00020000000000000001)"}}
{"t":{"$date":"2023-06-24T00:00:49.608+08:00"},"s":"D2", "c":"QUERY",    "id":20591,   "ctx":"conn50","msg":"Winning plan","attr":{"planSummary":"IXSCAN { a: 1, b: 1 }"}}
{"t":{"$date":"2023-06-24T00:00:49.608+08:00"},"s":"D1", "c":"QUERY",    "id":20937,   "ctx":"conn50","msg":"Creating inactive cache entry for query","attr":{"query":"ns: tradeDB.test query: { a: 1.0, b: 1.0 } sort: {} projection: {}","queryHash":"43CAB4C5","planCacheKey":"48C5AC81","newWorks":4}}
 

 

I don't know why it's not safe to choose {a:1, b:1} over {a:1}, Can you give me an example,thanks。

 

 

Comment by Steve Tarzia [ 22/Jun/23 ]

Hi 1147952115@qq.com , thank you for your contribution.  We are not going to accept this change at this time.  We don't expect customers to define both {a:1} and {a:1, b:1} indexes so we don't want to add code to our system to address this case.  Also, it's not clear that {a:1} and {a:1, b:1} execution performance will be the same, so it's not safe to choose {a:1, b:1} over {a:1}.

Comment by y yz [ 16/Jun/23 ]

In addition,This feature helps to avoid index skew。

for example:

  Let's say we have two indexes({a:1}, {a:1, b:1}), the sql is db.collection.find({a:xxx,b:{$tge:xxx, $lte:xxx}})。if The first few times, we shoud get total count that {a=xxx},In this case, the score is the same for both indexes,it may choice {a:1} as the winning index. 

   As time goes on,the {a:1} index plan‘s(getPlanCache.list) works may be larger, if the sql db.collection.find({a:xxx,b:{$tge:xxx, $lte:xxx}}), b filed arrow down, CachedPlanStage::pickBestPlan function can not trigger replan。

Comment by y yz [ 09/Jun/23 ]

Hi eric.sedor@mongodb.com, As in SERVER-71627, I have signed the [Contributor's Agreement].

thanks

Comment by Eric Sedor [ 08/Jun/23 ]

Hi 1147952115@qq.com; As in SERVER-77901 we'd like you to sign our [Contributor's Agreement](https://www.mongodb.com/legal/contributor-agreement) for us to consider this PR.

Comment by y yz [ 08/Jun/23 ]

the code address: https://github.com/mongodb/mongo/pull/1552

 

this is an improved feature, but Ican not choose "Improvement" button.

 

Performance will be affected when there are many repeat contain indexes.  In addition,{a:1, b:1} is better than {a:1},because both cases are satisfied for db.collection.find({a:xx}) and db.collection.find({a:xx,b:xx}),it can also reduce the number of replan.

Generated at Thu Feb 08 06:36:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.