-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Replication
-
ALL
-
Repl 2026-03-30, Repl 2026-04-13, Repl 2026-04-27
-
None
-
None
-
None
-
None
-
None
-
None
-
None
In disaggregated storage, ReplSetHeartbeatResponse::initialize() fails on every heartbeat response received by the primary from the standby. The error is NoSuchKey: Missing expected field "durableOpTime". This silently disables all logic gated behind the initialization success check in _handleHeartbeatResponse().
Root Cause
The disagg standby's processHeartbeatV1() (in replication_coordinator_disagg_heartbeats.cpp:327-458) builds a minimal heartbeat response containing only:
{"ok": 1, "term": <N>, "opTime": \{...}, "wallTime": <date>}
The standard ReplSetHeartbeatResponse::initialize() (in repl_set_heartbeat_response.cpp:132) requires fields that disagg never provides:
| Required Field | Present in Disagg Response? | Fails At |
|---|---|---|
| durableOpTime | No | Line 174 — hard failure, returns error |
| durableWallTime | No | Line 181 (never reached) |
| writtenOpTime | No | Line 212 (would fall back to applied — OK) |
| state | No | Line 236 (optional — OK) |
The parser hits durableOpTime first and returns Status(NoSuchKey) immediately.
Evidence
Primary mongod log (every 2 seconds, throughout the entire test run):
{"s":"D2", "c":"REPL_HB", "id":11222004, "msg":"Encountered error when initializing heartbeat response",
"attr":{"error":{"code":4,"codeName":"NoSuchKey","errmsg":"Missing expected field \"durableOpTime\""}}}
150 occurrences observed during a 5-minute YCSB test — i.e., every single heartbeat.
Impact
1. Term Check + Step-Down Logic is Dead (Correctness Risk)
Lines 263-283 of _handleHeartbeatResponse() contain term comparison and step-down logic:
if (responseInitializationStatus.isOK()) { // <-- always false auto heartbeatTerm = resp.getIntField("term"); auto currentTerm = getTerm(); if (heartbeatTerm > currentTerm) { updateTerm(heartbeatTerm); setCurrentPrimaryIndex(resp.getIntField("primaryId")); // "primaryId" is also missing from response _stepDown(...); } }
The primary can never detect a higher term from the standby's heartbeat response. If the standby somehow has a higher term, this step-down path is dead.
Note: resp.getIntField("primaryId") would also return 0 (default) since the standby response doesn't include primaryId, which would incorrectly set the primary index.
Severity assessment: In practice, term changes in disagg are likely handled through other paths. However, this path being dead is a latent correctness risk if disagg ever relies on heartbeat response term checking for step-down.
2. Any Future Code Using hbResponse Will Silently Fail
Any code added inside the if (responseInitializationStatus.isOK()) block, or any code using the hbResponse object, will be dead code.
AI Suggested Fix Options
Option A: Make the term check bypass initialize() (Minimal)
Move the term check outside the if (responseInitializationStatus.isOK()) block and read from raw resp BSON (it already does this — resp.getIntField("term")). Remove the primaryId reference (not available in disagg response) or add it to the standby's response.
Option B: Add missing fields to disagg heartbeat response (Proper)
In processHeartbeatV1(), set the additional fields that initialize() requires:
response->setDurableOpTimeAndWallTime(getMyLastDurableOpTimeAndWallTime());
response->setWrittenOpTimeAndWallTime(getMyLastWrittenOpTimeAndWallTime());
This would make initialize() succeed, but requires determining what values are semantically correct for durableOpTime and writtenOpTime in disagg context (these concepts may not map cleanly).
Option C: Disagg-specific initialize() (Longer-term)
Create a lighter-weight parser that only requires the fields disagg actually provides. This avoids polluting the response with fields that have no disagg-specific meaning.