[SERVER-19936] Performance pass on unicode-aware text processing logic (text index v3) Created: 13/Aug/15  Updated: 17/Oct/17  Resolved: 15/Mar/16

Status: Closed
Project: Core Server
Component/s: Performance, Querying, Text Search
Affects Version/s: None
Fix Version/s: 3.2.5, 3.3.3

Type: Improvement Priority: Major - P3
Reporter: David Daly Assignee: Mathias Stearn
Resolution: Done Votes: 1
Labels: code-and-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File phrase_query_text_index_version_3.png    
Issue Links:
Duplicate
duplicates SERVER-19944 Apply basic perf optimizations to tex... Closed
Related
related to SERVER-20613 Performance Regression on Mongo-perf ... Closed
related to SERVER-21690 Text Search - Performance Regression ... Closed
is related to SERVER-30870 Add FTS Fast Byte Vector Optimization... Closed
Backwards Compatibility: Fully Compatible
Backport Completed:
Sprint: Integration F (02/01/16), Integration 10 (02/22/16), Integration 11 (03/14/16), Integration 12 (04/04/16)
Participants:

 Description   

There was a performance regression from the introduction of text index version 3, visible in the Mongo-perf Queries.Text tests. There should be a passthrough the code to try to improve performance.

Initial Results showing regression.



 Comments   
Comment by Githook User [ 15/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'redbeard0531@gmail.com'}

Message: SERVER-19936 Enterprise module fix for changes to unicode::String API

(cherry picked from commit d89cf868a3987caa0ceeac576173f3fdd90f00ca)
Branch: v3.2
https://github.com/10gen/mongo-enterprise-modules/commit/9b9746e90dea7a7d9c77d90a829466ab1b1d2d7f

Comment by Githook User [ 15/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-23088 fix boost's libstdcpp detection under clang

Fixes compilation errors introduced by SERVER-19936 when compiling with clang
on a system without boost headers installed.

(cherry picked from commit 4b6952e97e74d8c7bd16ebfc5fe6e412ccf0f48c)
Branch: v3.2
https://github.com/mongodb/mongo/commit/9f68e62265bcc15307edd32aca8bd278ddc570f3

Comment by Githook User [ 15/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 use StringMap in FTSSpec::_scoreStringV2

(cherry picked from commit 657288e29880c0c8518452880715d57effdbeb89)
Branch: v3.2
https://github.com/mongodb/mongo/commit/5dcafe240222e32e89d43479cca23866e24d3c64

Comment by Githook User [ 15/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 Optimize UnicodeFTSTokenizer

(cherry picked from commit 4b10e50494175df2b1ed8fc4f8e7f8c6ca6f06d5)
Branch: v3.2
https://github.com/mongodb/mongo/commit/d86c3cfc8633e602df15c90bdf3f2c3aa7f4819d

Comment by Githook User [ 15/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 Inline libstemmer utilities

(cherry picked from commit 72aab77138463d96494389bc538c13395c34a2d3)
Branch: v3.2
https://github.com/mongodb/mongo/commit/ccdce56aa2f9b40ab2ffaf53c5dfef0786164a1a

Comment by Githook User [ 15/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 Rename unicode::string::prepForSubstrMatch and make easier to use

(cherry picked from commit 35f4f2f5a58e5dc90b583e8bc6089eaa2d83e065)
Branch: v3.2
https://github.com/mongodb/mongo/commit/50698e205ad1e7889279140fd0cb5e51ae9fefea

Comment by Githook User [ 15/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 Vector-optimize FTS phrase matches

Now handles up to 16 bytes of ASCII at a time if SSE2 is enabled.

(cherry picked from commit 67eee08bb606537df7417670d423c0527dd6221f)
Branch: v3.2
https://github.com/mongodb/mongo/commit/a00caa4dbe0152e821bdc628c6c6dad9fa824461

Comment by Githook User [ 15/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 Optimize FTS v3 phrase matching

Major changes:

  • Use Booyer-Moore algorithm for searching rather than std::search
  • All strings are kept in UTF8 rather than going to UTF32.
  • Case folding and diacritic removal are done in a single pass.
  • Optimize case folding and diacritic removal for ASCII codepoints.
  • Combine functionality of codepointIsDiacritic() into
    codepointRemoveDiacritics()

(cherry picked from commit 6c3157f126bb44ab275325e85de7abee5ce9ad6d)
Branch: v3.2
https://github.com/mongodb/mongo/commit/aea82b1e74549014bf14632db6d45eb171349ee5

Comment by Githook User [ 15/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 Add boost/tr1/detail/config_all.hpp to our copy of boost

Needed for boost::boyer_moore_searcher.

(cherry picked from commit 4a35c7184e188354793f16d27e2330b3b5ce7f8f)
Branch: v3.2
https://github.com/mongodb/mongo/commit/76a178438e4e1fe45f1f255f301e4ea7cf245161

Comment by Githook User [ 14/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-23088 fix boost's libstdcpp detection under clang

Fixes compilation errors introduced by SERVER-19936 when compiling with clang
on a system without boost headers installed.
Branch: master
https://github.com/mongodb/mongo/commit/4b6952e97e74d8c7bd16ebfc5fe6e412ccf0f48c

Comment by Githook User [ 14/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-23088 fix boost's libstdcpp detection under clang

Fixes compilation errors introduced by SERVER-19936 when compiling with clang
on a system without boost headers installed.
Branch: master
https://github.com/mongodb/mongo/commit/3071389ed3476eeb1e6730bbc1f841addf54b383

Comment by Githook User [ 11/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'redbeard0531@gmail.com'}

Message: SERVER-19936 Enterprise module fix for changes to unicode::String API
Branch: master
https://github.com/10gen/mongo-enterprise-modules/commit/d89cf868a3987caa0ceeac576173f3fdd90f00ca

Comment by Githook User [ 11/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 use StringMap in FTSSpec::_scoreStringV2
Branch: master
https://github.com/mongodb/mongo/commit/657288e29880c0c8518452880715d57effdbeb89

Comment by Githook User [ 11/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 Optimize UnicodeFTSTokenizer
Branch: master
https://github.com/mongodb/mongo/commit/4b10e50494175df2b1ed8fc4f8e7f8c6ca6f06d5

Comment by Githook User [ 11/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 Inline libstemmer utilities
Branch: master
https://github.com/mongodb/mongo/commit/72aab77138463d96494389bc538c13395c34a2d3

Comment by Githook User [ 11/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 Rename unicode::string::prepForSubstrMatch and make easier to use
Branch: master
https://github.com/mongodb/mongo/commit/35f4f2f5a58e5dc90b583e8bc6089eaa2d83e065

Comment by Githook User [ 11/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 Vector-optimize FTS phrase matches

Now handles up to 16 bytes of ASCII at a time if SSE2 is enabled.
Branch: master
https://github.com/mongodb/mongo/commit/67eee08bb606537df7417670d423c0527dd6221f

Comment by Githook User [ 11/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 Optimize FTS v3 phrase matching

Major changes:

  • Use Booyer-Moore algorithm for searching rather than std::search
  • All strings are kept in UTF8 rather than going to UTF32.
  • Case folding and diacritic removal are done in a single pass.
  • Optimize case folding and diacritic removal for ASCII codepoints.
  • Combine functionality of codepointIsDiacritic() into
    codepointRemoveDiacritics()
    Branch: master
    https://github.com/mongodb/mongo/commit/6c3157f126bb44ab275325e85de7abee5ce9ad6d
Comment by Githook User [ 11/Mar/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19936 Add boost/tr1/detail/config_all.hpp to our copy of boost

Needed for boost::boyer_moore_searcher.
Branch: master
https://github.com/mongodb/mongo/commit/4a35c7184e188354793f16d27e2330b3b5ce7f8f

Comment by J Rassi [ 03/Dec/15 ]

Further investigation shows that the magnitude of the performance difference of text index version 2 versus 3 is as large as ~20x for certain workloads.

With 3.2.0-rc6 configured with the WiredTiger storage engine, the phrase search operation count({$text: {$search: "\"gigantic hound\""}}) against the data set attached to SERVER-21690 has an average latency of 31.6 seconds with text index version 3 on my machine (in single-threaded tests), versus 1.6 seconds with text index version 2.

I've attached to this ticket a dot graph generated with Linux perf of the phrase query workload with text index version 3. Interesting observations:

  • 97% of collected samples are in calls made from UnicodeFTSPhraseMatcher::phraseMatches().
  • Of the UnicodeFTSPhraseMatcher::phraseMatches() samples, 42% of them were spent in unicode::String::removeDiacritics() or its children.
  • Of the UnicodeFTSPhraseMatcher::phraseMatches() samples, 28% of them were spent in unicode::codepointToLower() or its children.

Moving ticket back into Needs Triage.

Comment by David Daly [ 01/Oct/15 ]

Note: performance targets needed to be updated after this goes in. See SERVER-20613.

Comment by David Daly [ 17/Aug/15 ]

Re-opening as SERVER-19944 did not buy back a significant portion of the lost performance originally reported on this ticket.

Here's the perf data for the commit after SERVER-19944 (see the yellow dot).

Comment by Daniel Pasette (Inactive) [ 14/Aug/15 ]

Duplicate of SERVER-19944.

Comment by J Rassi [ 13/Aug/15 ]

Thanks. Triaged to Planning Bucket A and unassigned.

Comment by David Daly [ 13/Aug/15 ]

rassi@10gen.com Moved and assigned to you for now. I put it in Needs Triage for now.

Comment by Adam Chelminski (Inactive) [ 13/Aug/15 ]

I'm currently adding some simple optimizations that should improve the performance slightly, but as Rassi said, this regression was expected.

Comment by J Rassi [ 13/Aug/15 ]

I briefly discussed with Dan.

Yes, this slowdown is expected and of a reasonable magnitude. We should do a performance pass on the new code to make up for some of the regression, but we do not think this should be scheduled for 3.1.x and are happy to ship with this feature performing as-is. David, could you file a SERVER ticket (or convert this to a SERVER ticket) to track this work, and we'll consider it for 3.3.x or beyond?

Comment by David Daly [ 13/Aug/15 ]

mpobrien Seems likely. I have that commit and it's neighbors scheduled to run.

adam.chelminski rassi@10gen.com mark.benvenuto
I'm guessing as this makes version 3 text indexes default, that this is an expected slowdown. Could you take a look and see if the existence of a slowdown, and the magnitude of it seem reasonable?

Comment by Michael O'Brien [ 13/Aug/15 ]

Probably this commit i'm guessing: https://evergreen.mongodb.com/task/performance_linux_wt_standalone_query_92eac3b57d8beaf063fced8839cd870f97826bb7_15_08_11_20_58_14

Generated at Thu Feb 08 03:52:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.