[SERVER-13272] Text search should automatically sort by textScore; currently results are scrambled Created: 19/Mar/14 Updated: 10/Dec/14 Resolved: 19/Mar/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.6.0-rc1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | William Cross | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL | |||||||||||||||||||
| Steps To Reproduce: |
|
|||||||||||||||||||
| Participants: |
| Description |
|
Currently, a text search does not sort by textScore, or by $natural, or by anything that I can discern. This is a problem because people who put in multiple words will nearly always want to have the best match come first. I suggest we sort by textScore by default. Note from the example in steps to reproduce that a .find( { $text : {$search: "string"} } } ) returns documents that are neither in natural order, nor in score order. |
| Comments |
| Comment by William Cross [ 19/Mar/14 ] |
|
OK, I'm going to make one last argument, and then I'm going to defer to you guys who have (clearly) been thinking about these things a lot longer than I have. I think (postgres aside) that text searching is different enough from our core product that we need to think of it as a different product, and one inherently more akin to a search engine than a database in terms of functionality. For a standard query, a user might be looking at several fields, and they might want to order by any of them, or they might be indifferent. It makes sense to give them the first result we find if they decide to .limit(1), for example, and we shouldn't sort it. When it comes to your example from Connor Cunningham, for example, I totally agree with him. If you want ordered results, order them. Otherwise, if you assume they will arrive ordered, you get what you deserve. For a text query, though, there's an inherent presumption that you're looking for relevance. Someone looking for a blog post on "cats mice", or a resume that matches "python mongodb", or a product relating to "wet dog smell" would want it to behave like a search engine. And we're already building the system based on different principles than apply during our standard queries. We don't require all fields to match, for example, and we bother with these relevancy rankings that aren't a part of a standard query. If we had to use $or to get back documents that match the text field in different ways, I'd expect all documents that match to get returned in no particular order, but we don't. We're offering a new interface, and one that already breaks the database rules. With $text (and the $search flag!), we're not offering a database, we're offering a search engine, and users are going to expect it to behave like a search engine. But if I'm thinking about this wrong, well, like I said, I've undoubtedly been thinking about this a lot less than you have, and I'm going to defer to you. |
| Comment by J Rassi [ 19/Mar/14 ] |
The text command offers the results already scored and sorted (and, it's possible that new search functionality will go into the text command if it makes for a clumsy fit into the find() syntax). Low latency results are extremely important to almost all users.
That's sensible for Solr, a search engine. Contrast that to PostgreSQL, a database with a full-text search feature, for which an explicit ORDER BY clause is required to sort by rank. I would absolutely agree with your position if we were designing a different product. See also this Conor Cunningman blog post on a related topic: <http://blogs.msdn.com/b/conor_cunningham_msft/archive/2008/08/27/no-seatbelt-expecting-order-without-order-by.aspx>. |
| Comment by William Cross [ 19/Mar/14 ] |
|
Oh, also, C/o shannon.bradshaw@10gen.com, Solr search: https://wiki.apache.org/solr/SolrRelevancyFAQ "If no other sort order is specified, the default is by relevancy score." |
| Comment by William Cross [ 19/Mar/14 ] |
|
OK, so clearly this was written to spec. That said, the spec was wrong to have been written that way. Obviously, that's an opinion, so here's my argument. People who are using text search are probably not going to have a use case where they want one document with a text field that matches at least one of their words; they will want to work with the best match (or the best few). Imagine if (to date myself) AltaVista returned results that weren't, in some sense, ordered for me. It would never have become king of search engines for Google to trounce. The bottom line is, people use text search for different use cases than they use a database for, and our defaults should accommodate this. They do a text search because they want one (or a handful of) good match(es). The default behavior should be to order the results by ranking. If we suspect there's a use case out there where someone will prefer speed to a better match, we might want a $noSort flag, ugly as that may be. |
| Comment by J Rassi [ 19/Mar/14 ] |
For a find() without a sort(), the order of results returned by the cursor are undefined and may change between runs (with the exception of query predicates containing $near, which is being considered for deprecation for this reason). This allows the query engine to choose the most efficient plan to resolve the query, and the indexes that are assigned to the winning plan will change as the distribution of data in the collection changes.
The semantics of the find() query predicate are only to specify the subset of documents from the collection included in the results, not to specify the order of the results. The semantics of sort() are to specify an ordering of the results. The docs for the $text query operator makes clear to users that an explicit sort is required to guarantee score order. |