-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: LangGraph
-
None
-
None
-
Python Drivers
-
None
-
None
-
None
-
None
-
None
-
None
Target repo: langchain-ai/langchain-mongodb — libs/langgraph-checkpoint-mongodb
Summary
MongoDBSaver currently serializes the entire Checkpoint object — including channel values that are plain JSON primitives (str/int/float/bool/None) — into a single opaque BSON Binary field via JsonPlusSerializer.dumps_typed. The Postgres saver (langgraph-checkpoint-postgres) already handles this case in a smarter way: JSON-primitive channel values are stored natively in the JSONB column, while only rich Python objects fall through to the msgpack-encoded blob path. I'd like the Mongo saver to do the same, storing primitive channel values as native BSON in the document and reserving the Binary envelope for values that genuinely need it.
Current behavior
In MongoDBSaver.put, the entire checkpoint is serialized in one shot and the resulting bytes go straight into the document as Binary:
type_, serialized_checkpoint = self.serde.dumps_typed(checkpoint) ... doc = { "parent_checkpoint_id": config["configurable"].get("checkpoint_id"), "type": type_, "checkpoint": serialized_checkpoint, # <- entire state, as Binary "metadata": dumps_metadata(self.serde, metadata), } ... self.checkpoint_collection.update_one(upsert_query, {"$set": doc}, upsert=True)
put_writes makes the same trade for pending writes, putting the whole serialized value into a value: Binary field:
The result is that even a channel like system_prompt = "You are a deterministic echo bot." lives inside the opaque Binary blob, invisible to MongoDB's query planner, projection, aggregation, and indexing.
How langgraph-checkpoint-postgres solves this
The Postgres saver makes a per-channel decision at write time, splitting channel_values into a JSONB-inline portion and a blob portion before persisting:
# inline primitive values in checkpoint table # others are stored in blobs table blob_values = {} for k, v in checkpoint["channel_values"].items(): if isinstance(v, _DeltaSnapshot): blob_values[k] = copy["channel_values"].pop(k) copy["channel_values"][k] = True elif v is None or isinstance(v, (str, int, float, bool)): pass # stays inline as JSONB else: blob_values[k] = copy["channel_values"].pop(k) # -> msgpack blob
JSON-primitive scalars stay native in the JSONB column; complex values go to checkpoint_blobs (which also gets (channel, version) deduplication, but that's a separate optimization). On read, the saver merges the two halves back together: postgres/_init_.py L574
The user-visible result is that running the same workload against both backends produces, in Postgres, a row whose checkpoint JSONB contains:
"channel_values": { "system_prompt": "You are a deterministic echo bot." }
…fully queryable via SELECT checkpoint->'channel_values'->>'system_prompt' FROM checkpoints WHERE …. In MongoDB, the same value is bytes inside Binary.
Proposal
In MongoDBSaver.put, mirror the Postgres split:
- Walk checkpoint["channel_values"]. Partition keys into native_values (anything BSON can encode losslessly — at minimum str, int, float, bool, None; potentially also list/dict of the same; datetime is a candidate too) and binary_values (everything else).
- Make a shallow copy of the checkpoint with channel_values replaced by just native_values. Serialize that via dumps_typed (or skip serialization entirely for the envelope and store versions_seen/channel_versions/etc. as native sub-documents — see "stretch" below).
- Store the rich values either:
- (a) Minimal change: as a sibling field {{channel_values_binary: { <key>: Binary, ... }}} plus a sibling {{channel_values_types: { <key>: type_str, ... }}}. Reads merge the two on the way out.
- (b) Mongo-native: in a sibling collection checkpoint_blobs analogous to Postgres, indexed on (thread_id, checkpoint_ns, channel, version), which would also unlock (channel, version) deduplication for the long-stable-channel case (e.g. a constant system prompt across a 10k-turn thread).
get_tuple / list need a corresponding merge step. The JsonPlusSerializer is still used for the rich values, preserving full type fidelity (Pydantic models, BaseMessage subclasses, etc.).
The same split applies to put_writes: small scalar writes become native fields; rich writes stay as value: Binary.
Why this matters
- Server-side query and projection. Today, filtering or projecting on a channel value requires pulling and deserializing the full envelope client-side. After this change, scalar channels would be reachable via db.checkpoints.find({"checkpoint.channel_values.system_prompt": "..."}) and projectable.
- Indexability. Native scalar fields can be indexed. A common ask is "find all threads where a given user_id channel equals X" — currently impossible without scanning every document and decoding.
- Aggregations. $group / $match over scalar channel values becomes possible.
- Storage shape parity with Postgres. Today the same workload looks dramatically different in the two backends; option (b) above would also bring dedup parity for stable channels (in our eval, 15 checkpoints produced 15 blob rows in Postgres but 15 fully-rewritten documents in Mongo).
- Mongo-idiomatic. Storing structured data as opaque Binary defeats the point of using a document database. The current shape is "blobs in lipstick" — a Mongo-first design would either go all-native (which has type-fidelity problems for langgraph state) or do the hybrid that Postgres already does.
What stays the same
- JsonPlusSerializer remains the round-trip encoder for rich values — same serde the rest of the langgraph ecosystem uses, same type-tagging guarantees.
- The class signature, from_conn_string API, and MongoClient/AsyncMongoClient story are unchanged.
- Backward compatibility on read: detect old documents by the absence of the new fields and fall back to the current full-Binary path. New writes use the new shape.
Stretch goals (separate PRs welcome)
- Apply the same split to the top-level checkpoint envelope so fields like versions_seen, channel_versions, id, ts, v become native sub-documents/fields. These are dicts of strings — BSON encodes them losslessly. This would make Mongo's stored shape essentially identical to Postgres's checkpoints.checkpoint JSONB.
- Add a checkpoint_blobs collection (option (b) above) for (channel, version) dedup parity with Postgres.
Reproduction / motivating data
Running an identical 5-invoke StateGraph workload (one constant system_prompt: str channel and one growing messages: list[dict] channel) against both savers produces:
| Postgres | MongoDB | |
|---|---|---|
| Constant scalar system_prompt storage | inline in JSONB, 0 blob rows after first write | inside Binary, re-serialized into every document |
| Server-side queryable on system_prompt? | yes (->>'system_prompt') | no |
| Total channel_values blob rows for 15 checkpoints | 15 (10 messages + 5 _start_) | n/a — every document holds the full state |
Happy to share the eval script if useful.
Related code references
- Mongo put (current behavior): saver.py L411-L428
- Mongo put_writes (same pattern for writes): saver.py L472-L477
- Postgres split logic (target behavior to mirror): postgres/_init_.py L309-L322
- Postgres merge-on-read: postgres/_init_.py L574
- Postgres migrations / schema (for checkpoint_blobs design reference if option (b) is pursued): postgres/base.py