[SERVER-13573] Retry rollback FindCommonPoint before failing (and fasserting) Created: 14/Apr/14  Updated: 06/Dec/22  Resolved: 11/Jun/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.10, 2.6.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Benety Goh Assignee: Backlog - Replication Team
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-18035 Data Replicator: Refactor Rollback Code Closed
related to SERVER-5930 rollback loop should be smarter Closed
is related to SERVER-15089 Thread applier (bgsync) through repli... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:
Case:

 Description   

We observed in production an example of a replica set node going into a FATAL state as a result of a failed oplog query against an inaccessible primary node during the rollback 2 FindCommonPoint phase.

There might be other instances of replication failures resulting in FATAL but this is one instance we have observed in production.

FATAL node logs:

[rsBackgroundSync] replSet rollback 2 FindCommonPoint
[rsBackgroundSync] DBClientCursor::init call() failed
[rsBackgroundSync] replSet remote oplog empty or unreadable
[rsBackgroundSync] replSet error fatal, stopping replication

Primary replica set node relinquishing its PRIMARY status:

[rsMgr] replSet relinquishing primary state
[rsMgr] replSet SECONDARY
[rsMgr] replSet closing client sockets after relinquishing primary
(fatal node tries unsuccessfully to query this node's oplog while primary is closing client connections)

Health Poll logs on non-FATAL node in same replica set:

[rsHealthPoll] replSet member (fatal node hostname:port) is now in state FATAL

If there is a way to handle this case more gracefully, perhaps it might be possible to avoid going into a FATAL state.


Generated at Thu Feb 08 03:32:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.