Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-13573

Retry rollback FindCommonPoint before failing (and fasserting)

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major - P3 Major - P3
    • None
    • 2.4.10, 2.6.0
    • Replication
    • None
    • Replication
    • ALL

    Description

      We observed in production an example of a replica set node going into a FATAL state as a result of a failed oplog query against an inaccessible primary node during the rollback 2 FindCommonPoint phase.

      There might be other instances of replication failures resulting in FATAL but this is one instance we have observed in production.

      FATAL node logs:

      [rsBackgroundSync] replSet rollback 2 FindCommonPoint
      [rsBackgroundSync] DBClientCursor::init call() failed
      [rsBackgroundSync] replSet remote oplog empty or unreadable
      [rsBackgroundSync] replSet error fatal, stopping replication

      Primary replica set node relinquishing its PRIMARY status:

      [rsMgr] replSet relinquishing primary state
      [rsMgr] replSet SECONDARY
      [rsMgr] replSet closing client sockets after relinquishing primary
      (fatal node tries unsuccessfully to query this node's oplog while primary is closing client connections)

      Health Poll logs on non-FATAL node in same replica set:

      [rsHealthPoll] replSet member (fatal node hostname:port) is now in state FATAL

      If there is a way to handle this case more gracefully, perhaps it might be possible to avoid going into a FATAL state.

      Attachments

        Activity

          People

            backlog-server-repl Backlog - Replication Team
            benety.goh@mongodb.com Benety Goh
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: