[SERVER-27980] Secondary tries to rollback when it lagged to much behind primary Created: 13/Feb/17 Updated: 22/Feb/17 Resolved: 21/Feb/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Igor Solodovnikov | Assignee: | Mark Agarunov |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | To reproduce this situation I used modified version of zbigMapReduce.js test. I changed specified oplog size from 50MB to 2MB. Then I executed test this way:
where ~/ramd is 4GB tmpfs partition. |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
In replication scenario when there is a lot of inserts on the primary node secondary is not always able to replicate all inserts in time. Documentation on RECOVERING state clearly says that in this situation secondary should transition to RECOVERING mode and manual intervention is required:
But in reality secondary node sometimes tries to rollback in this situation leading to failed rollback and wrong node state. I believe the problem is in BackgroundSync::_produce function in src/mongo/db/repl/bgsync.cpp. When OplogStartMissing status code is returned by _syncSourceResolver this function correctly transition node to RECOVERING state. But when the same status code is returned by oplogFetcher the code executes rollback without going to RECOVERING state. I think oplog fetcher can return OplogStartMissing }} when the rollback is necessary or when secondary fall far behind primary. So there should be additional check if rollback is necessary or secondary should go to {{RECOVERING state. |
| Comments |
| Comment by Igor Solodovnikov [ 22/Feb/17 ] |
|
Hello @Mark Agarunov, Thank you for your comment. |
| Comment by Mark Agarunov [ 21/Feb/17 ] |
|
Hello igorsol, Thank you for the report. Looking over the provided output, the behavior you've described appears to be due to the bug in Thanks, |