[SERVER-39628] Strange behaviour of spatial queries Created: 15/Feb/19  Updated: 24/Jun/19  Resolved: 24/Jun/19

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: 4.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Nikolaos Koutroumanis Assignee: Eric Sedor
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive diagnostic-data.zip     Text File explain-real.txt     Text File explain-syntetic.txt     File mongod.log    
Operating System: ALL
Participants:

 Description   

I have two MongoDB databases (named real and synthetic) where each one has one collection (named geoPoints).

The two collections have the same number of documents with 4 fields (one of the fields are geoObjectJSON named location and its type is Point with longitude and latitude). They do not contain the same documents and they have 2d sphere indexes.

The execution of a spatial query on the collection of the real database is performed in 9 seconds (returning approximately 2M documents). Executing a query (larger in terms of spatial extent) on the collection of the synthetic database returns fewer documents (about 400k) and is performed in 5144s. The difference of the queries' execution time is huge. Is it reasonable for a spatial query which return fewer documents than a spatial query that returns more results, to require more time for its execution?

The mentioned execution time of the queries are after running them 3 times.

I attach the queryPlanner information for the two queries



 Comments   
Comment by Eric Sedor [ 24/Jun/19 ]

Hi nickkoutr, our apologies. Unfortunately we have not been able to identify an issue with the information available. Please reach out if you are able to assist with a clear set of reproduction steps.

Comment by Nikolaos Koutroumanis [ 21/Apr/19 ]

I uploaded the synthetic dataset. I'm still working on generating the trajectory data that resemble to the real data. I'll inform you when I finish with it. Feel free to ask anything about the uploaded data.

 

Comment by Eric Sedor [ 17/Apr/19 ]

Thanks nickkoutr; we appreciate it. Again, you can use this secure upload portal when you are ready.

Comment by Nikolaos Koutroumanis [ 17/Apr/19 ]

Hello, I'm sorry for the delayed responding. I am working these days on generating a dataset with trajectories on the road network (so as to resemble to the real dataset). Until weekend, i'll send you both the synthetic and the version of the real dataset.

Comment by Eric Sedor [ 16/Apr/19 ]

Hello, unfortunately we would still need to see at least the synthetic data set to keep investigating this. Is that possible?

Comment by Eric Sedor [ 13/Mar/19 ]

We understand nickkoutr. If you are able to upload just the synthetic data, we may be able to investigate just with that.

Comment by Nikolaos Koutroumanis [ 12/Mar/19 ]

I'm sorry but i can not provide the real dataset you requested because the data is private.

Comment by Eric Sedor [ 05/Mar/19 ]

nickkoutr,

It's possible the real database is staying in cache more than the synthetic database, but multiple query executions should have ruled that out.

To investigate further, we'd like to compare these data sets ourselves.

I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Can you please mongodump both real.geoPoints and synthetic.geoPoints, archive (tar or gzip) them, and upload them to this portal?

If the data to upload exceeds 5GB it will be necessary to break it down. You can use the split command as follows as a workaround:

split -d -b 5300000000 filename.tgz part.

This bash script produces a series of part.XX files, where XX is a number, for upload; We'll stitch these back together on our side.

Comment by Eric Sedor [ 04/Mar/19 ]

Thanks Nikolaos,

So far we do see that the query to the synthetic data set is yielding more often than the query to the real data set. This is shown by the following explain difference for synthetic:

"saveState" : 36562,
"restoreState" : 36562,

vs real:

"saveState" : 6259,
"restoreState" : 6259,

I am discussing this with the team.

Comment by Nikolaos Koutroumanis [ 01/Mar/19 ]

Thank you Eric, i attached it.

Comment by Eric Sedor [ 26/Feb/19 ]

nickkoutr sorry for the added step, but it looks like the attachments have been altered and the explain() against synthetic1 is no longer available to us. With apology, can you please add it back?

Comment by Eric Sedor [ 26/Feb/19 ]

Thanks for the additional information and your patience so far nickkoutr. We are continuing to investigate.

Comment by Nikolaos Koutroumanis [ 20/Feb/19 ]

Yes of course,  i attached it.

Comment by Eric Sedor [ 20/Feb/19 ]

Sorry for the added step nickkoutr, I should have been clearer above: Can you also include the server log files to help us match up timestamps there?

Comment by Nikolaos Koutroumanis [ 20/Feb/19 ]

Thank you Eric Sedor,

The data has changed a little bit (in terms of the number of documents), but the behaviour remains.

The mongodb is installed locally in my computer, i'm not using a cluster.

The two databases which are on the same machine,  contain now each one ~35Million Documents.

I attach two (new) explain files (real - synthetic) - The query on the synthetic dataset returns less documents than the query of real dataset, requiring although more time for its execution.

I also attach the directory you wanted. As you suggested, the last performed query is for the synthetic dataset (the slower one).

 

Comment by Eric Sedor [ 20/Feb/19 ]

Thanks nickkoutr, this does seem odd especially given that the faster query on "real" scans 3229554 documents and index keys whereas the slower query on "synthetic1" scans only 580377.

Can you let us know if these databases are on the same cluster?

And can you please archive (tar or zip) the $dbpath/diagnostic.data directory for the Primary node of the "synthetic1" database and attach it to this ticket? Please be sure to execute the slower of the queries before doing so to ensure metrics are collected for a recent execution.

Thanks,
Eric

Generated at Thu Feb 08 04:52:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.