-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Replication
-
ALL
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Bug
The disagg OplogProvider ships oplog entries to the SLS log server. After a yield or on step-up it rebuilds its cursor and seeks back to its last read position via seekExact(lastRecordIdRead), asserting the record is found (oplog_provider.cpp#L296).
Oplog truncation is independent of the provider's read position, so if it removes that entry, seekExact returns none and the node crashes.
Fix
The likely direction is to protect against truncation passing the resume point, similar to SERVER-128312 (which bounded truncation via computeTruncationBound()).
A few things to work out:
- What the resume point should anchor on. The provider's local read cursor seems insufficient, since on step-up the resume LSN comes from the remote log server (log_server_manager.cpp#L1302-L1309) with no check against the local oplog floor, so it may need to anchor on the SLS last-written LSN and hold on every electable node.
- Whether a pin alone is enough, or step-up also needs to handle a missing resume point gracefully (refuse step-up / resync) rather than crash-looping. This probably depends on whether such a bound can be guaranteed on a stepping-up node.
- is related to
-
SERVER-128312 Ensure oplog truncation does not pass last metadata checkpoint timestamp
-
- Closed
-