[SERVER-22528] Allow limit to be pushed down to text stage in query plans Created: 09/Feb/16  Updated: 27/Dec/23

Status: Backlog
Project: Core Server
Component/s: Text Search
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Saleem Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 1
Labels: qi-text-search
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File dataset.txt     File explain.json     Text File explain_term_terribly (post_warmup).txt     Text File explain_term_terribly (pre_warmup).txt     Text File explain_term_unfortunately (post_warmup).txt     Text File explain_term_unfortunately (pre_warmup).txt     Text File getIndexes.txt     Text File messages-stats.txt     PNG File mongodb-resources_usage.png     PNG File mongodb-system-utilization.png     PNG File mongodb-threads.png     Text File questionnaire.txt     Zip Archive sample_data_and_system_log.zip     Text File sample_doc.txt     Text File serverBuildInfo.txt     Text File systemInfo.txt    
Issue Links:
Duplicate
Related
related to SERVER-17537 FTS should short circuit an OR search... Backlog
Assigned Teams:
Query Integration
Participants:

 Description   

I have roughly 3.2 million documents and have created text index on one filed. It took me about an hour to complete index.

I notice that search is very very slow. I'm executing following statement.

db.messages.explain("executionStats").find({$text:{$search:"Unfortunately"}},{"body":1}).limit(2)

and it's taking me roughly 2 minutes to return result which is not acceptable for any serious use. I have attached full explain output for your review.

I'm wondering if there is any way to improve it's performance?



 Comments   
Comment by Saleem [ 12/Feb/16 ]

@Thomas thanks for update. Though query performance improved when entire dataset is loaded into memory but major problem is what if data is not loaded into memory. In this case it's essentially equivalent to document scanning without using any index.

Ironically, text index is not being loaded into memory so everything happens on disk. Please see pre and post warmup explain results.

Comment by Kelsey Schubert [ 12/Feb/16 ]

Hi sixthsense,

Thanks for the additional information. From your new explain results, I can see that we are looking at a similar dataset since we are examining the same number of documents. The number of documents examined are the number of documents containing the search term.

The new explain output you provided had faster query results especially after the working set was loaded into memory.

In the current implementation of the query system, the limit cannot be pushed down to limit the number of keys that the Text stage examines. Therefore, I am marking this ticket as an improvement request and sending it to the query team to be scheduled during the next round of planning. Please continue to watch this ticket for updates.

Kind regards,
Thomas

Comment by Saleem [ 10/Feb/16 ]

Asya, thanks for clarification. Yes, you are correct. MongoDB is blazing fast when inserted data in parallel. I have used following code to pump data in.

https://gist.github.com/saleem-mirza/6e5018283a77ecbb1511

Comment by Asya Kamsky [ 10/Feb/16 ]

One thing I noticed that while inserting document in a loop, my system usage is next to nothing. .. is mongod is single threaded app?

No, but since you are inserting documents in a loop, you are sending them to mongod one at a time, so when the load is single threaded, there is no gain for multiple threads in mongod to do anything. You might try making your inserts multithreaded.

Comment by Saleem [ 10/Feb/16 ]

in explain()

 "executionStats" : {
                "executionSuccess" : true,
                "nReturned" : 2,
                "executionTimeMillis" : 66,
                "totalKeysExamined" : 6226,
                "totalDocsExamined" : 6226,
                ...

If search term is found in more documents than what are we asking for, shouldn't be totalKeysExamined and totalDocsExamined no more than nReturned?

Comment by Saleem [ 10/Feb/16 ]

@Thomas, I have attached required information + some observations. If you still can't reproduce this issue, please email me and we can setup a live remote session.

Which storage engine are you using, WiredTiger or MMAPv1?

– WiredTiger

Can you please post the output of db.messages.stats()?

– Please see messages-stats.txt

Can you please post the output of db.collection.getIndexes()?

– Please see getIndexes.txt

Can you post the document that was inserted 4 million times?

– Please see sample_doc.txt

Did you set any options during the mongorestore?

– No

Can you post the output of the following?

db.messages.explain("executionStats").find({$text:{$search:"terribly"}},{"body":1}).limit(2)

– There are four output for terms terribly and Unfortunately.

Comment by Saleem [ 10/Feb/16 ]

Observations:

  • explain_term_terribly (pre_warmup).txt
  • explain_term_unfortunately (pre_warmup).txt

I don't see any memmory pressure when executing suggested queries.
This shows that nothing is loaded in memmory and everything is being read from disk. In my case, system is consuming arround 3.3 GB of memmory.

Following warmup script ran before text search, essentially reading last record.

 
var numMsgs = db.messages.count();
db.messages.find().skip(numMsgs-1).limit(1);

As you can see query performance is shockingly different. Also I see spike in memmory consumption. Now my system shows that 11GB of memmory is being used
which is 11 - 3.3 = 7.7GB. This shows that mongodb has loaded working set in memmory and queries are blazing fast.

  • explain_term_terribly (post_warmup).txt
  • explain_term_unfortunately (post_warmup).txt

NOTE: MongoDB instance is stoped and restarted before performing any query operation. So no cached data!

Comment by Kelsey Schubert [ 09/Feb/16 ]

Hi sixthsense,

After following the steps to create the dataset, I am not seeing this issue. My collection has 4501513 documents. I executed

db.messages.explain("executionStats").find({$text:{$search:"Unfortunately"}},{"body":1}).limit(2)

and saw the following output:

        "executionStats" : {
                "executionSuccess" : true,
                "nReturned" : 2,
                "executionTimeMillis" : 66,
                "totalKeysExamined" : 6226,
                "totalDocsExamined" : 6226,
                "executionStages" : {
                        "stage" : "LIMIT",
                        "nReturned" : 2,
                        "executionTimeMillisEstimate" : 70,
                        "works" : 6231,
                        "advanced" : 2,
                        "needTime" : 6228,
                        "needYield" : 0,
                        "saveState" : 48,
                        "restoreState" : 48,
                        "isEOF" : 1,
                        "invalidates" : 0,
                        "limitAmount" : 2,
                        "inputStage" : {
                                "stage" : "PROJECTION",
                                "nReturned" : 2,
                                "executionTimeMillisEstimate" : 70,
                                "works" : 6230,
                                "advanced" : 2,
                                "needTime" : 6228,
                                "needYield" : 0,
                                "saveState" : 48,
                                "restoreState" : 48,
                                "isEOF" : 0,
                                "invalidates" : 0,
                                "transformBy" : {
                                        "body" : 1
                                },
                                "inputStage" : {
                                        "stage" : "TEXT",
                                        "nReturned" : 2,
                                        "executionTimeMillisEstimate" : 70,
                                        "works" : 6230,
                                        "advanced" : 2,
                                        "needTime" : 6228,
                                        "needYield" : 0,
                                        "saveState" : 48,
                                        "restoreState" : 48,
                                        "isEOF" : 0,
                                        "invalidates" : 0,
                                        "indexPrefix" : {
 
                                        },
                                        "indexName" : "body_text",
                                        "parsedTextQuery" : {
                                                "terms" : [
                                                        "unfortun"
                                                ],
                                                "negatedTerms" : [ ],
                                                "phrases" : [ ],
                                                "negatedPhrases" : [ ]
                                        },
                                        "inputStage" : {
                                                "stage" : "TEXT_MATCH",
                                                "nReturned" : 2,
                                                "executionTimeMillisEstimate" : 70,
                                                "works" : 6230,
                                                "advanced" : 2,
                                                "needTime" : 6228,
                                                "needYield" : 0,
                                                "saveState" : 48,
                                                "restoreState" : 48,
                                                "isEOF" : 0,
                                                "invalidates" : 0,
                                                "docsRejected" : 0,
                                                "inputStage" : {
                                                        "stage" : "TEXT_OR",
                                                        "nReturned" : 2,
                                                        "executionTimeMillisEstimate" : 70,
                                                        "works" : 6230,
                                                        "advanced" : 2,
                                                        "needTime" : 6228,
                                                        "needYield" : 0,
                                                        "saveState" : 48,
                                                        "restoreState" : 48,
                                                        "isEOF" : 0,
                                                        "invalidates" : 0,
                                                        "docsExamined" : 6226,
                                                        "inputStage" : {
                                                                "stage" : "IXSCAN",
                                                                "nReturned" : 6226,
                                                                "executionTimeMillisEstimate" : 0,
                                                                "works" : 6227,
                                                                "advanced" : 6226,
                                                                "needTime" : 0,
                                                                "needYield" : 0,
                                                                "saveState" : 48,
                                                                "restoreState" : 48,
                                                                "isEOF" : 1,
                                                                "invalidates" : 0,
                                                                "keyPattern" : {
                                                                        "_fts" : "text",
                                                                        "_ftsx" : 1
                                                                },
                                                                "indexName" : "body_text",
                                                                "isMultiKey" : true,
                                                                "isUnique" : false,
                                                                "isSparse" : false,
                                                                "isPartial" : false,
                                                                "indexVersion" : 1,
                                                                "direction" : "backward",
                                                                "indexBounds" : {
 
                                                                },
                                                                "keysExamined" : 6226,
                                                                "dupsTested" : 6226,
                                                                "dupsDropped" : 0,
                                                                "seenInvalidated" : 0
                                                        }
                                                }
                                        }
                                }
                        }
                }
        },

I don't think we are looking at the same dataset, can you double check whether there are any other steps required to reproduce this issue?

In addition, please answer the following questions.

  1. Which storage engine are you using, WiredTiger or MMAPv1?
  2. Can you please post the output of db.messages.stats()?
  3. Can you please post the output of db.collection.getIndexes()?
  4. Can you post the document that was inserted 4 million times?
  5. Did you set any options during the mongorestore?
  6. Can you post the output of the following?

    db.messages.explain("executionStats").find({$text:{$search:"terribly"}},{"body":1}).limit(2)
    

Quickly, regarding threading, please review our documentation on concurrency.

Thank you,
Thomas

Comment by Saleem [ 09/Feb/16 ]

One thing I noticed that while inserting document in a loop, my system usage is next to nothing. If you see mongodb-threads.png, it's only using one core and rest are pretty much sitting idle.

Surprisingly, there isn't much pressure on IO too. I believe when inserting document, mongodb must be building indexes too.

no judgement but is mongod is single threaded app?

Comment by Saleem [ 09/Feb/16 ]

Thank you Ramon for quick reply. I'm sorry I mentioned wrong version number. Please read it as 3.2.0.

Please see this thread:

http://stackoverflow.com/questions/35236072/indexed-full-text-search-count-in-mongodb-is-exremely-slow-and-what-is-text/35236718

Comment by Saleem [ 09/Feb/16 ]

Please find zip file includes:

1) dataset.txt
instructions to download and build sample dataset.

2) serverBuildInfo.txt
mongod db.serverBuildInfo output

3) systemInfo.txt
mongod host information. (CPU, DISK, RAM etc)

Comment by Ramon Fernandez Marina [ 09/Feb/16 ]

Hi sixthsense, unfortunately there's not enough information in your report to determine if this behavior is expected or if you're hitting a bug. Please provide more details such as:

  • Server information: RAM, CPU, storage type...
  • The output of db.serverBuildInfo(); you mention 3.2.1, but the explain() output says 3.2.0
  • Full logs for this node during a slow query, preferably after increasing the log level to 1

If you're able to share your dataset with us so we can investigate please let me know, and I'll provide a private upload portal. Otherwise it will be helpful to see a set of sample documents so we can attempt to reproduce this behavior on our end.

Thanks,
Ramón.

Comment by Saleem [ 09/Feb/16 ]

I'd request gurus to bump up it's priority and mark as defect. It's a real production issue. Please search stack overflow.

One behavior worth of mentioning here is:

When performing above query, system memory usage quickly goes up two fold of actual data size. Once query is done, system reclaims memory. This is recurring pattern so I believe mongod is not loading and keeping index in memory for sake of efficiecy.

Furthermore from slowness, it appears that engine is not using index at all though "explain()" tells different story. just my two cents.

Comment by Saleem [ 09/Feb/16 ]

.explain("executionStats") result.

Generated at Thu Feb 08 04:00:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.