[SERVER-50386] [SBE] Investigate handling of deprecated BSON types Created: 19/Aug/20  Updated: 29/Oct/23  Resolved: 27/Jan/21

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Task Priority: Major - P3
Reporter: Eric Cox (Inactive) Assignee: Justin Seyster
Resolution: Fixed Votes: 0
Labels: qexec-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Sprint: Query 2020-12-14, Query 2020-12-28, Query 2021-01-11, Query 2021-01-25, Query 2021-02-08
Participants:

 Description   

When implementing SERVER-50369 we came across some legacy code that handles a deprecated BSON subtype ByteArrayDeprecated. This ticket represents work that should be done to make sure that we are correctly handling this subtype of BinData in sbe. We should re-examine value::getBSONBinData() for this subtype and make sure that the pointer returned correctly points to the underlying payload and not the extra int32 at the head of the buffer.  Another possible location to consider is deserializeTagVal() and serializeTagVal().



 Comments   
Comment by Githook User [ 27/Jan/21 ]

Author:

{'name': 'Justin Seyster', 'email': 'justin.seyster@mongodb.com', 'username': 'jseyster'}

Message: SERVER-50386 Testing and documentation for ByteArrayDeprecated subtype
Branch: master
https://github.com/mongodb/mongo/commit/657fd55617da405757b94bc9973df40394a18e5b

Comment by Justin Seyster [ 04/Dec/20 ]

To summarize the issue, BinData with the "ByteArrayDeprecated" has four redundant bytes (to store the length of the data) where the binary data would normally start, and the sbe::value::getBSONBinData() function returns a pointer to those bytes, which may cause problems for clients that are expecting to see the binary data itself. I say these four bytes are "redundant" because the actual length of the BinData is stored in the BSON itself, and there is no BSON parsing code that actually uses the length value stored in a "BytesArrayDeprecated" BinData.

I audited the current uses of getBSONBinData() and getBSONBinDataSize(), and I believe they are all correct as written. There may be some callers who will want to ignore the four redundant length bytes, but in most cases, we want to keep them.

1. When used in convertToBsonObj, the redundant bytes are included in the data that gets sent to the BSONObj, which is the desired behavior to ensure that they also get saved in the new BSONObj.
https://github.com/mongodb/mongo/blob/23aa52a605cb55220d5366b1005391f075a2e7f0/src/mongo/db/exec/sbe/values/bson.cpp#L293-L299
https://github.com/mongodb/mongo/blob/23aa52a605cb55220d5366b1005391f075a2e7f0/src/mongo/db/exec/sbe/values/bson.cpp#L380-L386

2. In serializeTagVal() (discussed in the description, above), the redundant bits get saved to disk, which is the desired behavior, because we want to preserve their value even if it gets ignored in most cases.
https://github.com/mongodb/mongo/blob/23aa52a605cb55220d5366b1005391f075a2e7f0/src/mongo/db/exec/sbe/values/slot.cpp#L265-L271

3. writeValueToStream(), which can write a human-readable debug value of a BinData, includes its own special case code to ignore the redundant bytes and only print out the data itself. That's what we want, but perhaps we could move that special case handling into a value.h helper.
https://github.com/mongodb/mongo/blob/23aa52a605cb55220d5366b1005391f075a2e7f0/src/mongo/db/exec/sbe/values/value.cpp#L387-L394

4. sbe::value::hashValue() includes the redundant length bytes in its hash. That should be fine, because two deprecated BinData objects that store different values for their redundant length fields should compare as not equal to each other. This behavior may not be optimal, because those bytes eat up half of the maximum eight bytes that get used to compute the hash, but I don't think it's important enough to do anything about it.
https://github.com/mongodb/mongo/blob/23aa52a605cb55220d5366b1005391f075a2e7f0/src/mongo/db/exec/sbe/values/value.cpp#L604-L618

5. sbe::value::compareValue() includes the redundant bytes in its comparison, which is weird behavior, but it matches the behavior in woCompare(). We definitely want to match woCompare()'s behavior so that we can stay compatible with the legacy execution engine.
https://github.com/mongodb/mongo/blob/23aa52a605cb55220d5366b1005391f075a2e7f0/src/mongo/db/exec/sbe/values/value.cpp#L756-L768
https://github.com/mongodb/mongo/blob/23aa52a605cb55220d5366b1005391f075a2e7f0/src/mongo/bson/bsonelement.cpp#L411-L417

6. builtinBitTestPosition() will treat the four redundant bytes as part of the bitmask, but surprisingly, that's the same behavior as in the classic engine. That's arguably a bug, but I think our best bet is to just keep it as is in both engines:
https://github.com/mongodb/mongo/blob/23aa52a605cb55220d5366b1005391f075a2e7f0/src/mongo/db/exec/sbe/vm/vm.cpp#L1391-L1392
https://github.com/mongodb/mongo/blob/23aa52a605cb55220d5366b1005391f075a2e7f0/src/mongo/db/matcher/expression_leaf.cpp#L714

In the BSONElement class, there are actually two accessors for BinData, one that is oblivious to the redundant bytes (i.e., it treats them as if they are part of the actual data) and a "clean" accessor that returns the bytes after the redundant length field (i.e., the actual data) when the subtype is "BytesArrayDeprecated." We could mirror this approach in value.h by adding "clean" versions of sbe::value::getBSONBinData() and sbe::value::getBSONBinDataSize(). Except for writeValueToStream (#4 in the above list), all of the current users of getBSONBinData() will continue to use the existing version. The writeValueToStream function can be adapted to use the new versions so that it has less special-case logic. I'll whip something up and upload a code review for it.

Sorry for the disseration! I wanted to get all my thoughts on this down so that I don't forget them.

Generated at Thu Feb 08 05:22:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.