[SERVER-27980] Secondary tries to rollback when it lagged to much behind primary Created: 13/Feb/17  Updated: 22/Feb/17  Resolved: 21/Feb/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Igor Solodovnikov Assignee: Mark Agarunov
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-27403 Consider term and rbid when validatin... Closed
Related
is related to SERVER-28068 Do not go into rollback due to fallin... Closed
Operating System: ALL
Steps To Reproduce:

To reproduce this situation I used modified version of zbigMapReduce.js test. I changed specified oplog size from 50MB to 2MB. Then I executed test this way:

python buildscripts/resmoke.py --dbpathPrefix=~/ramd --executor=sharding --jobs=4 --storageEngine=wiredTiger jstests/sharding/zbigMapReduce.js

where ~/ramd is 4GB tmpfs partition.

Participants:

 Description   

In replication scenario when there is a lot of inserts on the primary node secondary is not always able to replicate all inserts in time. Documentation on RECOVERING state clearly says that in this situation secondary should transition to RECOVERING mode and manual intervention is required:

Due to overload, a secondary may fall far enough behind the other members of the replica set such that it may need to resync with the rest of the set. When this happens, the member enters the RECOVERING state and requires manual intervention.

But in reality secondary node sometimes tries to rollback in this situation leading to failed rollback and wrong node state.

I believe the problem is in BackgroundSync::_produce function in src/mongo/db/repl/bgsync.cpp. When OplogStartMissing status code is returned by _syncSourceResolver this function correctly transition node to RECOVERING state. But when the same status code is returned by oplogFetcher the code executes rollback without going to RECOVERING state. I think oplog fetcher can return OplogStartMissing }} when the rollback is necessary or when secondary fall far behind primary. So there should be additional check if rollback is necessary or secondary should go to {{RECOVERING state.



 Comments   
Comment by Igor Solodovnikov [ 22/Feb/17 ]

Hello @Mark Agarunov,

Thank you for your comment.
sure, I will watch SERVER-27403 and I will recheck this issue when SERVER-27403 gets fixed.

Comment by Mark Agarunov [ 21/Feb/17 ]

Hello igorsol,

Thank you for the report. Looking over the provided output, the behavior you've described appears to be due to the bug in SERVER-27403. The behavior is described in more detail in this comment. Please watch SERVER-27403 if you would like to follow updates on this issue.

Thanks,
Mark

Generated at Thu Feb 08 04:16:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.