Uploaded image for project: 'Python Driver'
  1. Python Driver
  2. PYTHON-1504

isLegalUTF8 check in bson is not 100% correct

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Minor - P4 Minor - P4
    • 3.7
    • Affects Version/s: None
    • Component/s: BSON
    • Labels:
      None

      The check for valid utf8 bytes in BSON does not catch all cases.
      Take for example the following code example (done with python2.7 and pymongo 3.4)

      import bson
      
      # data from the cpe
      bad_data = "\xf4\\\x89\x93';"
      
      # encode it with pymongo's bson
      m = bson.BSON.encode({'x': bad_data})
      
      # decode it (should work right ?)
      bson.BSON.decode(m)
      
      # AHHH it doesn't
      # InvalidBSON: 'utf8' codec can't decode byte 0xf4 in position 0: invalid continuation byte
      

      And yes if you ask python to decode it with utf8 it fails.

      bad_data.decode('utf8')
      

      I think the check in the BSON module does not validate correctly if the first byte is 244.
      According to this: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout . Not all byte combinations after the 244 are possible, python checks this, but pymongos BSON doesn't.

      I attached an example program which tries all 2**32 different bit combinations (takes ~60 minutes on a 8-core machine). As you can see, when the first byte is 244, python and pymongo have for around 500k a different opinion what is valid or not.

      python test_all.py
      .....
      range with leading byte 243 is good
      with leading byte 244 this amount differs 524288
      range with leading byte 245 is good
      .......
      

      My tests were done with python2.7 and with pymongo 2.8 and 3.4. However by looking at the history of the validation code also newer versions (3.6.1) and older version seem to be affected.
      https://github.com/mongodb/mongo-python-driver/blame/3.6.1/bson/encoding_helpers.c

      Since I'm curious I hacked up a little python2 C-Extension (see attached) which uses the UTF8 validation from the mongodb C-driver.
      https://github.com/mongodb/mongo-c-driver/blob/master/src/libbson/src/bson/bson-utf8.c

      This validation is in sync with python, no differences. So my proposal would be to replace the validation code in pymongo/bson with the one from the C-driver.

        1. _isutf8.c
          4 kB
        2. test_all.py
          1 kB

            Assignee:
            bernie@mongodb.com Bernie Hackett
            Reporter:
            Stephan2018 Stephan Hof
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: