MongoDB stuck in RECOVERING loop due to stale minvalid marker after role change

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Won't Fix
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • None
    • Replication
    • ALL
    • Hide

      Steps to Reproduce

      1. Take a "delayed node" from Cluster A.
      1. Decommission it and move it to Cluster B (Archive) without a full wipe of the local database.
      1. Perform a database engine upgrade or a full service restart.
      1. The node enters an infinite recovery loop trying to satisfy the minvalid timestamp from the old cluster.

      Observed Behavior

      The database engine fails to reconcile the minvalid marker ($2025$) with the current Oplog ($2026$). Instead of identifying the marker as obsolete or triggering a fatal error with a clear cleanup instruction, the node remains indefinitely in RECOVERING state.

      Show
      Steps to Reproduce Take a "delayed node" from Cluster A. Decommission it and move it to Cluster B (Archive) without a full wipe of the local database. Perform a database engine upgrade or a full service restart. The node enters an infinite recovery loop trying to satisfy the minvalid timestamp from the old cluster. Observed Behavior The database engine fails to reconcile the minvalid marker ($2025$) with the current Oplog ($2026$). Instead of identifying the marker as obsolete or triggering a fatal error with a clear cleanup instruction, the node remains indefinitely in RECOVERING state.
    • Repl 2026-03-30
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      We are reporting a recovery loop issue encountered during a database engine upgrade. The node (previously a "delayed node" in a Production cluster, and now a one node replica-set) remained stuck in the RECOVERING state.

      The root cause was identified as a stale consistency marker in the local.replset.minvalid collection, dating back to February 2025, which conflicted with the current 2026 Oplog. The instance was repurposed for an Archive cluster in October, but it appears the metadata from its previous role was not cleared/invalidated automatically during the transition.

      Steps to Reproduce

      1. Take a "delayed node" from Cluster A.
      1. Decommission it and move it to Cluster B (Archive) without a full wipe of the local database.
      1. Perform a database engine upgrade or a full service restart.
      1. The node enters an infinite recovery loop trying to satisfy the minvalid timestamp from the old cluster.

      Observed Behavior

      The database engine fails to reconcile the minvalid marker ($2025$) with the current Oplog ($2026$). Instead of identifying the marker as obsolete or triggering a fatal error with a clear cleanup instruction, the node remains indefinitely in RECOVERING state.

      Workaround/Fix Applied

      We successfully recovered the node by:

      1. Restarting the instance in standalone mode (shunning the replica configuration).
      1. Dropping the local.replset.minvalid collection manually.
      1. Restarting with the original configuration.

      Upon performing these steps, the node correctly elected itself as Primary.

            Assignee:
            Pierre Turin
            Reporter:
            Omry Hassan
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: