Loading...

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: LangGraph
Labels:
None

Confidence Status:
None

Assigned Teams:

Python Drivers

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

Target repo: langchain-ai/langchain-mongodb — libs/langgraph-checkpoint-mongodb

Summary

MongoDBSaver currently serializes the entire Checkpoint object — including channel values that are plain JSON primitives (str/int/float/bool/None) — into a single opaque BSON Binary field via JsonPlusSerializer.dumps_typed. The Postgres saver (langgraph-checkpoint-postgres) already handles this case in a smarter way: JSON-primitive channel values are stored natively in the JSONB column, while only rich Python objects fall through to the msgpack-encoded blob path. I'd like the Mongo saver to do the same, storing primitive channel values as native BSON in the document and reserving the Binary envelope for values that genuinely need it.

Current behavior

In MongoDBSaver.put, the entire checkpoint is serialized in one shot and the resulting bytes go straight into the document as Binary:

saver.py L411-L428

type_, serialized_checkpoint = self.serde.dumps_typed(checkpoint)
...
doc = {
    "parent_checkpoint_id": config["configurable"].get("checkpoint_id"),
    "type": type_,
    "checkpoint": serialized_checkpoint,   # <- entire state, as Binary
    "metadata": dumps_metadata(self.serde, metadata),
}
...
self.checkpoint_collection.update_one(upsert_query, {"$set": doc}, upsert=True)

put_writes makes the same trade for pending writes, putting the whole serialized value into a value: Binary field:

saver.py L472-L477

The result is that even a channel like system_prompt = "You are a deterministic echo bot." lives inside the opaque Binary blob, invisible to MongoDB's query planner, projection, aggregation, and indexing.

How langgraph-checkpoint-postgres solves this

The Postgres saver makes a per-channel decision at write time, splitting channel_values into a JSONB-inline portion and a blob portion before persisting:

postgres/_init_.py L309-L322

# inline primitive values in checkpoint table
# others are stored in blobs table
blob_values = {}
for k, v in checkpoint["channel_values"].items():
    if isinstance(v, _DeltaSnapshot):
        blob_values[k] = copy["channel_values"].pop(k)
        copy["channel_values"][k] = True
    elif v is None or isinstance(v, (str, int, float, bool)):
        pass                                        # stays inline as JSONB
    else:
        blob_values[k] = copy["channel_values"].pop(k)   # -> msgpack blob

JSON-primitive scalars stay native in the JSONB column; complex values go to checkpoint_blobs (which also gets (channel, version) deduplication, but that's a separate optimization). On read, the saver merges the two halves back together: postgres/_init_.py L574

The user-visible result is that running the same workload against both backends produces, in Postgres, a row whose checkpoint JSONB contains:

"channel_values": { "system_prompt": "You are a deterministic echo bot." }

…fully queryable via SELECT checkpoint->'channel_values'->>'system_prompt' FROM checkpoints WHERE …. In MongoDB, the same value is bytes inside Binary.

Proposal

In MongoDBSaver.put, mirror the Postgres split:

Walk checkpoint["channel_values"]. Partition keys into native_values (anything BSON can encode losslessly — at minimum str, int, float, bool, None; potentially also list/dict of the same; datetime is a candidate too) and binary_values (everything else).
Make a shallow copy of the checkpoint with channel_values replaced by just native_values. Serialize that via dumps_typed (or skip serialization entirely for the envelope and store versions_seen/channel_versions/etc. as native sub-documents — see "stretch" below).
Store the rich values either:
- (a) Minimal change: as a sibling field {{channel_values_binary: { <key>: Binary, ... }}} plus a sibling {{channel_values_types: { <key>: type_str, ... }}}. Reads merge the two on the way out.
- (b) Mongo-native: in a sibling collection checkpoint_blobs analogous to Postgres, indexed on (thread_id, checkpoint_ns, channel, version), which would also unlock (channel, version) deduplication for the long-stable-channel case (e.g. a constant system prompt across a 10k-turn thread).

get_tuple / list need a corresponding merge step. The JsonPlusSerializer is still used for the rich values, preserving full type fidelity (Pydantic models, BaseMessage subclasses, etc.).

The same split applies to put_writes: small scalar writes become native fields; rich writes stay as value: Binary.

Why this matters

Server-side query and projection. Today, filtering or projecting on a channel value requires pulling and deserializing the full envelope client-side. After this change, scalar channels would be reachable via db.checkpoints.find({"checkpoint.channel_values.system_prompt": "..."}) and projectable.
Indexability. Native scalar fields can be indexed. A common ask is "find all threads where a given user_id channel equals X" — currently impossible without scanning every document and decoding.
Aggregations. $group / $match over scalar channel values becomes possible.
Storage shape parity with Postgres. Today the same workload looks dramatically different in the two backends; option (b) above would also bring dedup parity for stable channels (in our eval, 15 checkpoints produced 15 blob rows in Postgres but 15 fully-rewritten documents in Mongo).
Mongo-idiomatic. Storing structured data as opaque Binary defeats the point of using a document database. The current shape is "blobs in lipstick" — a Mongo-first design would either go all-native (which has type-fidelity problems for langgraph state) or do the hybrid that Postgres already does.

What stays the same

JsonPlusSerializer remains the round-trip encoder for rich values — same serde the rest of the langgraph ecosystem uses, same type-tagging guarantees.
The class signature, from_conn_string API, and MongoClient/AsyncMongoClient story are unchanged.
Backward compatibility on read: detect old documents by the absence of the new fields and fall back to the current full-Binary path. New writes use the new shape.

Stretch goals (separate PRs welcome)

Apply the same split to the top-level checkpoint envelope so fields like versions_seen, channel_versions, id, ts, v become native sub-documents/fields. These are dicts of strings — BSON encodes them losslessly. This would make Mongo's stored shape essentially identical to Postgres's checkpoints.checkpoint JSONB.
Add a checkpoint_blobs collection (option (b) above) for (channel, version) dedup parity with Postgres.

Reproduction / motivating data

Running an identical 5-invoke StateGraph workload (one constant system_prompt: str channel and one growing messages: list[dict] channel) against both savers produces:

	Postgres	MongoDB
Constant scalar `system_prompt` storage	inline in JSONB, 0 blob rows after first write	inside `Binary`, re-serialized into every document
Server-side queryable on `system_prompt`?	yes (`->>'system_prompt'`)	no
Total `channel_values` blob rows for 15 checkpoints	15 (10 `messages` + 5 `_start_`)	n/a — every document holds the full state

Happy to share the eval script if useful.

Related code references

Mongo put (current behavior): saver.py L411-L428
Mongo put_writes (same pattern for writes): saver.py L472-L477
Postgres split logic (target behavior to mirror): postgres/_init_.py L309-L322
Postgres merge-on-read: postgres/_init_.py L574
Postgres migrations / schema (for checkpoint_blobs design reference if option (b) is pursued): postgres/base.py

Details

Description

Summary

Current behavior

How langgraph-checkpoint-postgres solves this

Proposal

Why this matters

What stays the same

Stretch goals (separate PRs welcome)

Reproduction / motivating data

Related code references

Attachments

Activity

People

Dates