Uploaded image for project: 'Node.js Driver'
  1. Node.js Driver
  2. NODE-6246

ObjectIds stored as Buffer in memory are memory inefficient and slow

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: BSON
    • 0
    • Not Needed

      We are been running into memory issues while pulling large Mongo result sets into memory, and after lots of memory profiling the issue seems to be related to ObjectId, specifically how memory inefficient Buffer/ArrayBuffer is when storing lots of small Buffers.

      We were expecting an ObjectId to consume ~12bytes of memory (+ some overhead), but in reality this consumes 128 bytes per ObjectId (96 bytes for just the Buffer).

      I opened an issue with the NodeJS Performance team but this appears to be working as designed: https://github.com/nodejs/performance/issues/173

      Storing a string in Node/V8 is much more memory efficient since it's a primitive. A 24 character hex string only consumes 40 bytes of memory, AND it's much faster to serialize/deserialize.

       

      Memory Usage

      With this change, retained memory usage is reduced by ~45% (ObjectId size decreases from 128 bytes to 72 bytes) ** and young memory allocation is reduced by ~57% during deserialization. Currently, the hex string (40 bytes) already needs to be allocated in memory as ESON, then deserialized into BSON (which allocates 96 bytes for the buffer), then the string needs to be garbage collected. Since V8 uses string interning pool the string does not need to be reallocated if we persist it in the ObjectId.

      By reducing the amount of memory that needs to be allocated we are further improving performance since garbage collection is quite expensive.

      Performance Tests

      I made the change to persist ObjectId as string in memory, and we only need to use Buffer to generate a new pk. Tested on M1 MacBook Pro

       

      Operation ObjectId (ops/sec) ObjectId STRING (ops/sec) Relative Improvement (%)
      new ObjectId() 10,662,177 5,842,941 -45.18%
      new ObjectId(string) 5,316,351 11,913,614 118.00%
      deserialize 3,630,156 11,401,144 214.06%
      serialize 11,498,443 82,022,643 613.54%
      toHexString 11,314,960 81,086,841 616.57%
      equals 39,653,912 44,413,312 11.99%

      new ObjectId() does appear to be slower at first glance because it still uses Buffer to generate an ObjectId then it has to convert the buffer to hex string, but once you consider most paths will have to serialize that ObjectId after creating a new PK then it still comes out marginally faster.

      Operation ObjectId (ops/sec) ObjectId STRING (ops/sec) Relative Improvement (%)
      new ObjectId() + serialize 5,785,814 5,900,631 2%

      Example

      Lets say we have a MongoDB collection with some user documents like this that points to another document.

      {
        _id: ObjectId,
        name: string,
        addressId: ObjectId
      }

      Let's say we want to grab a bunch of the documents. 1 million in this example. Lets project to just the ObjectIds so other types don't skew our numbers

      const docs = await collection.find({}, { limit: 1000000, projection: { _id: 1, addressId: 1 } }).toArray() // lets grab 1 million documents
      

      BEFORE: 471 MB used

      AFTER: 262MB used 

      Use Case

      The above example was a little contrived, but we do have use cases where we will pull in large numbers of documents into memory to build Directed Acyclic Graphs, where each document represents an edge in the graph (this is particularly needed since MongoDB $graphLookup is quite limited).

      This could also occur frequently on a server where many users are requesting small batches of documents. 

      User Impact

      • I believe this issue is greatly effecting all users of MongoDB in JS environments (NodeJS/browser) regardless of use case, just to different extents. By making this change I believe we would improve performance and reduce memory usage for all users.

      Acceptance Criteria

      Implementation Requirements

      • ObjectId class should only persist hex string representation of ObjectId
      • ObjectId class API should not have any breaking changes
      • All currently ObjectId functionality should behave as it does today
      • Performance for most use cases must not degrade
      • cacheHexString property to be deprecated
      • Optional: Add cacheBuffer option to cache buffer

      Testing Requirements

      • unit tests
      • perf tests

      Documentation Requirements

      • DOCSP ticket, API docs, etc

      Follow Up Requirements

      • additional tickets to file, required releases, etc

            Assignee:
            Unassigned Unassigned
            Reporter:
            sean.reece@precisely.com Sean Reece
            Neal Beeken
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: