[CDRIVER-4453] libbson is slow to JSON encode and decode large strings Created: 11/Aug/22  Updated: 05/Feb/24

Status: Backlog
Project: C Driver
Component/s: Performance
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Unknown
Reporter: Shane Harvey Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to PYTHON-3395 Make ObjectId properly convert when b... Closed
Epic Link: Improve Developer Experience
Quarter: FY25Q2

 Description   

The Python team has a JSON library (https://github.com/mongodb-labs/python-bsonjs) built on top of libbson for better performance and our benchmarks show that it’s about 10x faster than PyMongo’s built in JSON encoder. However, a user just reported and I’ve reproduced (PYTHON-3395) that on documents comprised of large string fields, PyMongo performs 4x faster than libbson for both encoding and decoding.

For reference here's an example:

"""Benchmark bsonjs (libbson) vs bson.json_util (pymongo)."""
import timeit
import bsonjs
import bson
from bson import json_util
 
doc = {
    '_id': bson.ObjectId(),
    'string': 's'*20000
}
b = bson.encode(doc)
j = json_util.dumps(doc)
 
def time(fn, iterations=25):
    print('Timing: ' + fn.__name__)
    best = min(timeit.Timer(fn).repeat(5, number=iterations))
    print('{0} loops, best of 5: {1}'.format(iterations, best))
    return best
 
def compare(bsonjs_stmt, json_util_stmt):
    bsonjs_secs = time(bsonjs_stmt)
    json_util_secs = time(json_util_stmt)
    print('bsonjs is {0:.2f}x faster than json_util\n'.format(
        json_util_secs/bsonjs_secs))
 
def dumps_bsonjs():
    bsonjs.dumps(b)
 
def dumps_json_util():
    json_util.dumps(bson.decode(b))
 
def loads_bsonjs():
    bsonjs.loads(j)
 
def loads_json_util():
    bson.encode(json_util.loads(j))
 
def main():
    compare(dumps_bsonjs, dumps_json_util)
    compare(loads_bsonjs, loads_json_util)
 
if __name__ == "__main__":
    main()

And the output:

$ python3.10 benchmark_str_perf.py
Timing: dumps_bsonjs
25 loops, best of 5: 0.00783308400423266
Timing: dumps_json_util
25 loops, best of 5: 0.002030832998570986
bsonjs is 0.26x faster than json_util
 
Timing: loads_bsonjs
25 loops, best of 5: 0.001949673009221442
Timing: loads_json_util
25 loops, best of 5: 0.000629648013273254
bsonjs is 0.32x faster than json_util

Removing the large string from the document yields the expected perf improvement:

doc = {
    '_id': bson.ObjectId(),
    'string': 's'*10,
    'foo': [1, 2],
    'bar': {'hello': 'world'},
    'date': datetime.datetime(2009, 12, 9, 15),
}
...
$ python3.10 benchmark_str_perf.py
Timing: dumps_bsonjs
25 loops, best of 5: 0.00018512399401515722
Timing: dumps_json_util
25 loops, best of 5: 0.001294998000958003
bsonjs is 7.00x faster than json_util
 
Timing: loads_bsonjs
25 loops, best of 5: 0.00016003800556063652
Timing: loads_json_util
25 loops, best of 5: 0.0011928190069738775
bsonjs is 7.45x faster than json_util

CC: colby.pike@mongodb.com


Generated at Wed Feb 07 21:20:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.