Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-3031

Bit order is not accurately described by BSON binary vector spec

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Component/s: BSON
    • Not Needed

      Summary

      The "BSON Binary Subtype 9 - Vector" specification describes an overall array packing format, as well as the format of three specific item type instances of this array format.

      The overall purpose of the spec is to describe a way to encode arrays of uniformly typed values using a sequence of bytes, but the lack of specificity around byte and bit terminology makes the overall packing order unclear until we reach the examples. The example then directly contradicts the above prose.

      Regarding padding the "number of bits to ignore" is also ambiguous in text and only cleared up by the example. It would be more accurate to say the padding is the number of bits at a particular side of the byte, or a number of potential items to ignore. The difference is minimal at this point but it will affect how future data types can be defined. (Alternatively, the padding could be specified as having a type-specific interpretation)

      The spec claims, before giving examples, that "all values use the little-endian format". In fact this is only true for the order of individual array items that are more than a byte in size. Items smaller than a byte (packed_bit) are packed in big endian format (most significant bit first).

      Some of the confusion can be attributed to a disagreement in the definition of big/little endian. A general definition applicable to both bits and bytes would be that "big" order pairs most significant with lowest address (often first / leftmost) and "little" pairs least significant with lowest address. This general usage is already contained in the spec's reference to numpy.unpackbits, which uses the general interpretation of endianness.

      The second question in the FAQ relates to this issue as well. It claims that we would "choose to use integers in [0, 256)" because "this technique is widely used..." but in fact the reason is that BSON encoding is the process of producing a byte stream, and byte streams are the smallest practical unit on most computers. So, in this context the choice of bytes isn't relevant. But there's an unanswered question just nearby: For packed_bit we in fact did have a choice of word width, but only because of the mixed bit order scheme! If the packed_bit were instead defined to be LSB-first, in a world where wider values are also little-endian, there would be no need to specify the word size for a packed_bit because they would all have equivalent behavior.

       

      Motivation

      Who is the affected end user?

      Developers who read our specs.

      How does this affect the end user?

      Developers are confused, or they lose confidence in the specification.

      How likely is it that this problem or use case will occur?

      It's a new feature, so there's pressure to get it right but odds are relatively few people are watching.

      If the problem does occur, what are the consequences and how severe are they?

      Unlikely to result in incorrect implementation, due to test coverage. Just a potential loss of developer time and confidence.
      Lack of specificity about how future dtypes operate may hamper interoperability.

      Is this issue urgent?

      No

      Is this ticket required by a downstream team?

      Drivers

      Is this ticket only for tests?

      No

      Acceptance Criteria

      Peer review of spec changes, no known inconsistencies, careful description of data formatting concepts both for specified types and for planned future types.

            Assignee:
            micah.scott@mongodb.com Micah Scott
            Reporter:
            micah.scott@mongodb.com Micah Scott
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: