-
Type:
Bug
-
Resolution: Won't Fix
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Replication
-
ALL
-
-
Repl 2026-03-30
-
None
-
None
-
None
-
None
-
None
-
None
-
None
We are reporting a recovery loop issue encountered during a database engine upgrade. The node (previously a "delayed node" in a Production cluster, and now a one node replica-set) remained stuck in the RECOVERING state.
The root cause was identified as a stale consistency marker in the local.replset.minvalid collection, dating back to February 2025, which conflicted with the current 2026 Oplog. The instance was repurposed for an Archive cluster in October, but it appears the metadata from its previous role was not cleared/invalidated automatically during the transition.
Steps to Reproduce
- Take a "delayed node" from Cluster A.
- Decommission it and move it to Cluster B (Archive) without a full wipe of the local database.
- Perform a database engine upgrade or a full service restart.
- The node enters an infinite recovery loop trying to satisfy the minvalid timestamp from the old cluster.
Observed Behavior
The database engine fails to reconcile the minvalid marker ($2025$) with the current Oplog ($2026$). Instead of identifying the marker as obsolete or triggering a fatal error with a clear cleanup instruction, the node remains indefinitely in RECOVERING state.
Workaround/Fix Applied
We successfully recovered the node by:
- Restarting the instance in standalone mode (shunning the replica configuration).
- Dropping the local.replset.minvalid collection manually.
- Restarting with the original configuration.
Upon performing these steps, the node correctly elected itself as Primary.