[SERVER-34768] Rollback can fail if run against a lagged node that catches up Created: 01/May/18 Updated: 29/Oct/23 Resolved: 15/Jan/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.2.4, 4.3.3, 4.0.17 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Vesselina Ratcheva (Inactive) | Assignee: | Siyuan Zhou |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v4.2, v4.0
|
||||||||||||||||||||
| Sprint: | Repl 2018-06-18, Repl 2018-07-02, Repl 2018-07-16, Repl 2018-07-30, Repl 2019-11-18, Repl 2019-12-02, Repl 2020-01-27 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 52 | ||||||||||||||||||||
| Description |
|
It is possible to decide to roll back against a sync source that is behind the rollback node (due to receiving an empty batch), then resolve the common point when that same source is ahead. This leads to the rollback node crashing during oplog truncation, as there are no entries after the common point. |
| Comments |
| Comment by Githook User [ 10/Feb/20 ] |
|
Author: {'username': 'visualzhou', 'name': 'Siyuan Zhou', 'email': 'siyuan.zhou@mongodb.com'}Message: (cherry picked from commit 319757ebb72611fb91044a2a81d1b77a6f3729c1)
|
| Comment by Githook User [ 07/Feb/20 ] |
|
Author: {'username': 'visualzhou', 'name': 'Siyuan Zhou', 'email': 'siyuan.zhou@mongodb.com'}Message: (cherry picked from commit 319757ebb72611fb91044a2a81d1b77a6f3729c1) |
| Comment by Githook User [ 14/Jan/20 ] |
|
Author: {'name': 'Siyuan Zhou', 'email': 'siyuan.zhou@mongodb.com', 'username': 'visualzhou'}Message: |
| Comment by Tess Avitabile (Inactive) [ 09/Jan/20 ] |
|
Feel free to work on this on BF Friday. |
| Comment by Siyuan Zhou [ 15/Oct/19 ] |
|
Re-opening this ticket since To answer tess.avitabile's question above, Will pointed out in BF-14623:
|
| Comment by William Schultz (Inactive) [ 23/Jul/18 ] |
|
Fixed by |
| Comment by Tess Avitabile (Inactive) [ 23/Jul/18 ] |
|
That makes sense to me. Thank you for investigating. I think it's fine to close this ticket. |
| Comment by Judah Schvimer [ 01/May/18 ] |
|
I think it is definitely a bug if a node chooses a sync source that is behind it. It doesn't look like the SyncSourceResolver checks that the sync source candidate is ahead of it, which means that if chaining is disallowed, nothing is preventing nodes from choosing a sync source behind them. If chaining is not disabled, the fact that we compare our lastAppliedOpTime to a potential candidate's (and not just the timestamp), likely prevents us from syncing from a node behind us. |
| Comment by Spencer Brody (Inactive) [ 01/May/18 ] |
|
We probably need to do more to make sure that we don't go into rollback against a node that's just behind us but not on a divergent branch of history. It's possible this can only happen if chaining is disallowed. We also should probably take a look at catchup_takeover_two_nodes_ahead.js and make sure there's no issue with the test. |