[SERVER-25173] $substrBytes does not handle negative parameters correctly. Created: 20/Jul/16  Updated: 10/May/18  Resolved: 22/Mar/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 3.1.4
Fix Version/s: 3.7.4

Type: Bug Priority: Major - P3
Reporter: Benjamin Murphy Assignee: Ian Boros
Resolution: Done Votes: 2
Labels: asya
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to DOCS-8550 Document $substrBytes/$substrCP Closed
Backwards Compatibility: Minor Change
Operating System: ALL
Sprint: Query 2018-03-26
Participants:

 Description   

$substrBytes requires two arguments: starting index, and length. Parsing and validation occurs here.

Both are required to be integral, however, we static cast the values to a size_t without checking to ensure they're nonnegative. At the very least, we should give an informative error message if they are negative. It is also conceivable that we want a negative value to mean "return the rest of the string", as in some other operators, in which case we'll need to add special handling.



 Comments   
Comment by Githook User [ 22/Mar/18 ]

Author:

{'email': 'ian.boros@10gen.com', 'name': 'Ian Boros'}

Message: SERVER-25173 $substrBytes now checks for negative values
Branch: master
https://github.com/mongodb/mongo/commit/8ea933cf533740a33477cd48155f3590ceb3e3fe

Comment by Asya Kamsky [ 01/Feb/18 ]

Note for docs: currently the $substr (alias) is documented as:

https://docs.mongodb.com/manual/reference/operator/aggregation/substr/#behavior

If <start> is a negative number, $substr returns an empty string "".

If <length> is a negative number, $substr returns a substring that starts at the specified index and includes the rest of the string.

Once this bug is fixed, appropriate documentation updates should be made.

To treat all negative numbers as errors would be backwards breaking for Bytes but it's already an error for CP so that would be consistent.

To treat negative position as positions from the end would be consistent with other substring functions, but for length it does not seem consistent (it may be considered as length from the end?).

Comment by Max Hirschhorn [ 31/Aug/16 ]

The "Invalid range, ending index is in the middle of a UTF-8 character" message has been able to be erroneously returned for certain negative lengths and non-ASCII strings since it was introduced in bc45142 as part of the changes from SERVER-6801. The issue is that the expression lower + length is triggering unsigned integer overflow.

From the C99 standard:

A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.

Since -1 is equivalent to (2^64 - 1) mod 2^64, we have the following behavior when checking the value of str[lower + length].

$ python
>>> "cafétéria"
'caf\xc3\xa9t\xc3\xa9ria'
>>> "cafétéria"[6 - 1]
't'
>>> "cafétéria"[5 - 1]
'\xa9'

Comment by Benjamin Murphy [ 20/Jul/16 ]

kay.kim mentioned this to me--her example:

MongoDB Enterprise > db.foo.aggregate( { $project: { x: { $substrBytes: [  "cafétéria", 6, -1 ] } } } )
{ "_id" : ObjectId("578efbf41c2742205a3e1710"), "x" : "éria" }
 
MongoDB Enterprise > db.foo.aggregate( { $project: { x: { $substrBytes: [  "cafétéria", 5, -1 ] } } } )
assert: command failed: {
	"ok" : 0,
	"errmsg" : "$substrBytes:  Invalid range, ending index is in the middle of a UTF-8 character.",
	"code" : 28657
} : aggregate failed

Generated at Thu Feb 08 04:08:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.