|
We observed in production an example of a replica set node going into a FATAL state as a result of a failed oplog query against an inaccessible primary node during the rollback 2 FindCommonPoint phase.
There might be other instances of replication failures resulting in FATAL but this is one instance we have observed in production.
FATAL node logs:
[rsBackgroundSync] replSet rollback 2 FindCommonPoint
|
[rsBackgroundSync] DBClientCursor::init call() failed
|
[rsBackgroundSync] replSet remote oplog empty or unreadable
|
[rsBackgroundSync] replSet error fatal, stopping replication
|
Primary replica set node relinquishing its PRIMARY status:
[rsMgr] replSet relinquishing primary state
|
[rsMgr] replSet SECONDARY
|
[rsMgr] replSet closing client sockets after relinquishing primary
|
(fatal node tries unsuccessfully to query this node's oplog while primary is closing client connections)
|
Health Poll logs on non-FATAL node in same replica set:
[rsHealthPoll] replSet member (fatal node hostname:port) is now in state FATAL
|
If there is a way to handle this case more gracefully, perhaps it might be possible to avoid going into a FATAL state.
|