Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Critical - P2
Fix Version/s: None
Affects Version/s: 3.2.10
Component/s: Replication
Labels:
None

Operating System:
ALL
Case:
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We had a weekend full of networks issues which caused data replication all over our topology to become stale (slaves were lagging after masters etc.)

All lags and discrepancies were shortly settled once the network issues were addressed (in databases such as MySQL, Cassandra)

A couple of our mongo replica sets in production, however, failed to recover from the network outages, as can be seen in the log enclosed hereby, specifically for one of them, there appeared to be several attempts at re-election for primary followed by attempts to interconnect with other members which failed due to network unavailability and ignited a background sync rollback that finally has failed after several attempts
And eventually resulted the above mentioned Fatal assertion message.
Forcing us into restarting the primary replica and any other failing nodes in order to restore normal replication cross replica set.

2016-12-02T21:37:35.681+0000 I -        [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 21, timestamp: Dec  2 21:37:33:2) > our last optime: (term: 20, timestamp: Dec  2 21:37:32:1)

As mentioned before, this has occurred on a couple of mission critical replica sets which failed to recover from it.
We noticed this behaviour might have to do with the mongo release version which all of our replica sets are on, which is 3.2.10.

We've upgraded from 3.2.9 to 3.2.10 after some harsh performance related bugs,
Shall we go ahead und upgrade our sets to 3.2.11 to resolve this?
Is thee another analysis you can provide and steps to be followed as recommendation?

We really need those replica sets stable working on mongo, same as other resilient db solutions we have and that have recovered from this completely.

Thanks for your prompt advise!
Enclosed as well is the diagnostics directory for that primary member in which the erroneous behaviour was spotted.

Avi Kalvo
WiX DBA

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

mongodb_shoutout_subs_diagnostic_data.tar.gz
Dec 05 2016 01:17:41 PM UTC
16.47 MB
Avraham Kalvo
mongodb_shoutout_subs.tar.gz
Dec 05 2016 01:14:09 PM UTC
9.97 MB
Avraham Kalvo

duplicates

SERVER-25145 During rollback (or w/minvalid invalid) select sync source based on end OpTime

Closed

Assignee:: Unassigned
Reporter:: Avraham Kalvo
Participants:: Avraham Kalvo, Kelsey Schubert, Ramon Fernandez
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Dec 05 2016 01:19:18 PM UTC
Updated:: Jul 17 2017 03:40:30 PM UTC
Resolved:: Dec 06 2016 10:08:41 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates