Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.4.1
Component/s: Replication
Labels:
None

Operating System:
ALL
Steps To Reproduce:
Hide

To reproduce this situation I used modified version of zbigMapReduce.js test. I changed specified oplog size from 50MB to 2MB. Then I executed test this way:

python buildscripts/resmoke.py --dbpathPrefix=~/ramd --executor=sharding --jobs=4 --storageEngine=wiredTiger jstests/sharding/zbigMapReduce.js

where ~/ramd is 4GB tmpfs partition.
Show
To reproduce this situation I used modified version of zbigMapReduce.js test. I changed specified oplog size from 50MB to 2MB. Then I executed test this way: python buildscripts/resmoke.py --dbpathPrefix=~/ramd --executor=sharding --jobs=4 --storageEngine=wiredTiger jstests/sharding/zbigMapReduce.js where ~/ramd is 4GB tmpfs partition.
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

In replication scenario when there is a lot of inserts on the primary node secondary is not always able to replicate all inserts in time. Documentation on RECOVERING state clearly says that in this situation secondary should transition to RECOVERING mode and manual intervention is required:

Due to overload, a secondary may fall far enough behind the other members of the replica set such that it may need to resync with the rest of the set. When this happens, the member enters the RECOVERING state and requires manual intervention.

But in reality secondary node sometimes tries to rollback in this situation leading to failed rollback and wrong node state.

I believe the problem is in BackgroundSync::_produce function in src/mongo/db/repl/bgsync.cpp. When OplogStartMissing status code is returned by _syncSourceResolver this function correctly transition node to RECOVERING state. But when the same status code is returned by oplogFetcher the code executes rollback without going to RECOVERING state. I think oplog fetcher can return OplogStartMissing }} when the rollback is necessary or when secondary fall far behind primary. So there should be additional check if rollback is necessary or secondary should go to {{RECOVERING state.

duplicates

SERVER-27403 Consider term and rbid when validating the proposed sync source

Closed

is related to

SERVER-28068 Do not go into rollback due to falling off the back of your sync source's oplog

Closed

Assignee:: Mark Agarunov (Inactive)
Reporter:: Igor Solodovnikov
Participants:: Igor Solodovnikov, Mark Agarunov
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Feb 13 2017 12:33:31 PM UTC
Updated:: Feb 22 2017 09:37:37 AM UTC
Resolved:: Feb 21 2017 08:10:40 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates