[SERVER-54485] FTS indexes pass c-strings to tokenizer so they ignore data in strings past the first nul byte Created: 12/Feb/21  Updated: 03/Jun/21  Resolved: 03/Jun/21

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Querying
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mathias Stearn Assignee: Mickey Winters
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Sprint: Query Execution 2021-03-08, Query Execution 2021-03-22, Query Execution 2021-04-05, Query Execution 2021-04-19, Query Execution 2021-05-03, Query Execution 2021-05-17, Query Execution 2021-06-14
Participants:

 Description   

"Luckily" the same bug exists in both indexing and querying. Unfortunately, because it is in indexing, it will require a version bump to fix and we would need to keep the old code around.

 

This also adds a relatively slight overhead because it is computing strlen rather than just using the size stored in the BSONElement. But that cost is trivial relative to the actual tokenization, so I don't think it is a good motivation alone to fix this.



 Comments   
Comment by Mickey Winters [ 03/Jun/21 ]

Thanks for flagging this! After investigation this does not seem to cause data corruption or server crashes. There are 2 main correctness issues.

  1. if you insert a document like {msg: "foo \0 bar"} into a collection with a text index on msg then tokens after the null byte are not indexed. so an FTS search for bar would not return this document.
  2. if you insert a document like {msg: "foo"} and perform a search like {$text: {$search: "\0 foo"}} the matcher will not add tokens after the null byte to the search. so the inserted document would not be returned in this FTS.

If you perform a normal find query that would otherwise match the first document then you will get back {msg: "foo \0 bar"} because the problem is with indexing/querying not with storage.

This problem occurs because

  1. The Unicode strings used by UnicodeFTSTokenizer use some functions that assume this strings will be null terminated and will add a null terminator if it doesn't find one.
  2. Some FTS code passes along cstrings without also passing along its size

In order for a user to encounter this bug they would have to be trying to FTS indexes on unclean data that may have null bytes in it. Although this bug bothers me at least for now there doesn't seem to be much justification to warrant the time required to fix this.

Generated at Thu Feb 08 05:33:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.