Priority: Critical - P2
Affects Version/s: 3.2.10
(copied to CRM)
We had a weekend full of networks issues which caused data replication all over our topology to become stale (slaves were lagging after masters etc.)
All lags and discrepancies were shortly settled once the network issues were addressed (in databases such as MySQL, Cassandra)
A couple of our mongo replica sets in production, however, failed to recover from the network outages, as can be seen in the log enclosed hereby, specifically for one of them, there appeared to be several attempts at re-election for primary followed by attempts to interconnect with other members which failed due to network unavailability and ignited a background sync rollback that finally has failed after several attempts
And eventually resulted the above mentioned Fatal assertion message.
Forcing us into restarting the primary replica and any other failing nodes in order to restore normal replication cross replica set.
As mentioned before, this has occurred on a couple of mission critical replica sets which failed to recover from it.
We noticed this behaviour might have to do with the mongo release version which all of our replica sets are on, which is 3.2.10.
We've upgraded from 3.2.9 to 3.2.10 after some harsh performance related bugs,
Shall we go ahead und upgrade our sets to 3.2.11 to resolve this?
Is thee another analysis you can provide and steps to be followed as recommendation?
We really need those replica sets stable working on mongo, same as other resilient db solutions we have and that have recovered from this completely.
Thanks for your prompt advise!
Enclosed as well is the diagnostics directory for that primary member in which the erroneous behaviour was spotted.