[SERVER-27277] [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError on numerous 3.2.10 replica sets Created: 05/Dec/16  Updated: 17/Jul/17  Resolved: 06/Dec/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.10
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Avraham Kalvo Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongodb_shoutout_subs.tar.gz     File mongodb_shoutout_subs_diagnostic_data.tar.gz    
Issue Links:
Duplicate
duplicates SERVER-25145 During rollback (or w/minvalid invali... Closed
Related
Operating System: ALL
Participants:
Case:

 Description   

We had a weekend full of networks issues which caused data replication all over our topology to become stale (slaves were lagging after masters etc.)

All lags and discrepancies were shortly settled once the network issues were addressed (in databases such as MySQL, Cassandra)

A couple of our mongo replica sets in production, however, failed to recover from the network outages, as can be seen in the log enclosed hereby, specifically for one of them, there appeared to be several attempts at re-election for primary followed by attempts to interconnect with other members which failed due to network unavailability and ignited a background sync rollback that finally has failed after several attempts
And eventually resulted the above mentioned Fatal assertion message.
Forcing us into restarting the primary replica and any other failing nodes in order to restore normal replication cross replica set.

2016-12-02T21:37:35.681+0000 I -        [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 21, timestamp: Dec  2 21:37:33:2) > our last optime: (term: 20, timestamp: Dec  2 21:37:32:1)

As mentioned before, this has occurred on a couple of mission critical replica sets which failed to recover from it.
We noticed this behaviour might have to do with the mongo release version which all of our replica sets are on, which is 3.2.10.

We've upgraded from 3.2.9 to 3.2.10 after some harsh performance related bugs,
Shall we go ahead und upgrade our sets to 3.2.11 to resolve this?
Is thee another analysis you can provide and steps to be followed as recommendation?

We really need those replica sets stable working on mongo, same as other resilient db solutions we have and that have recovered from this completely.

Thanks for your prompt advise!
Enclosed as well is the diagnostics directory for that primary member in which the erroneous behaviour was spotted.

Avi Kalvo
WiX DBA



 Comments   
Comment by Kelsey Schubert [ 06/Dec/16 ]

Hi avrahamk,

While there have been a number a performance improvements in 3.2.11 that may impact the behavior you are observing, we cannot point to a particular ticket that would resolve the issue described in SERVER-27132.

I've examined the logs in more detail and are confident that you are hitting SERVER-25145, and will be closing this ticket as a duplicate. If you encounter this issue again after upgrading, please open a new ticket and we will continue to investigate.

Thank you,
Thomas

Comment by Avraham Kalvo [ 06/Dec/16 ]

Thanks Ramon,

Can you also confirm 3.2.11 will resolve performance issues as reported by our end on another ticket with you SERVER-27132?

Thanks,
Avi K

Comment by Ramon Fernandez Marina [ 05/Dec/16 ]

avrahamk, the log line you put in the description seems to indicate that you're running into SERVER-25145, which was fixed in 3.2.11, so I'd recommend upgrading to 3.2.11 and report back.

Thanks,
Ramón.

Generated at Thu Feb 08 04:14:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.