Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
None

Assigned Teams:

Replication
Operating System:
ALL
Steps To Reproduce:
Hide

Steps to Reproduce

Take a "delayed node" from Cluster A.

Decommission it and move it to Cluster B (Archive) without a full wipe of the local database.

Perform a database engine upgrade or a full service restart.

The node enters an infinite recovery loop trying to satisfy the minvalid timestamp from the old cluster.

Observed Behavior

The database engine fails to reconcile the minvalid marker ($2025$) with the current Oplog ($2026$). Instead of identifying the marker as obsolete or triggering a fatal error with a clear cleanup instruction, the node remains indefinitely in RECOVERING state.
Show
Steps to Reproduce Take a "delayed node" from Cluster A. Decommission it and move it to Cluster B (Archive) without a full wipe of the local database. Perform a database engine upgrade or a full service restart. The node enters an infinite recovery loop trying to satisfy the minvalid timestamp from the old cluster. Observed Behavior The database engine fails to reconcile the minvalid marker ($2025$) with the current Oplog ($2026$). Instead of identifying the marker as obsolete or triggering a fatal error with a clear cleanup instruction, the node remains indefinitely in RECOVERING state.
Sprint:
Repl 2026-03-30
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We are reporting a recovery loop issue encountered during a database engine upgrade. The node (previously a "delayed node" in a Production cluster, and now a one node replica-set) remained stuck in the RECOVERING state.

The root cause was identified as a stale consistency marker in the local.replset.minvalid collection, dating back to February 2025, which conflicted with the current 2026 Oplog. The instance was repurposed for an Archive cluster in October, but it appears the metadata from its previous role was not cleared/invalidated automatically during the transition.

Steps to Reproduce

Take a "delayed node" from Cluster A.

Decommission it and move it to Cluster B (Archive) without a full wipe of the local database.

Perform a database engine upgrade or a full service restart.

The node enters an infinite recovery loop trying to satisfy the minvalid timestamp from the old cluster.

Observed Behavior

The database engine fails to reconcile the minvalid marker ($2025$) with the current Oplog ($2026$). Instead of identifying the marker as obsolete or triggering a fatal error with a clear cleanup instruction, the node remains indefinitely in RECOVERING state.

Workaround/Fix Applied

We successfully recovered the node by:

Restarting the instance in standalone mode (shunning the replica configuration).

Dropping the local.replset.minvalid collection manually.

Restarting with the original configuration.

Upon performing these steps, the node correctly elected itself as Primary.

Assignee:: Pierre Turin
Reporter:: Omry Hassan
Participants:: Omry Hassan, Pierre Turin
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Jan 28 2026 09:25:31 AM UTC
Updated:: Mar 17 2026 08:47:32 PM UTC
Resolved:: Mar 17 2026 08:47:32 PM UTC

Details

Steps to Reproduce

Observed Behavior

Description

Steps to Reproduce

Observed Behavior

Workaround/Fix Applied

Attachments

Activity

People

Dates