[SERVER-34728] Heartbeats are used to advance replication commit point Created: 27/Apr/18 Updated: 27/Oct/23 Resolved: 09/May/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.4.14, 3.6.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Tess Avitabile (Inactive) |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Requested: |
v3.6, v3.4
|
||||||||||||||||||||||||||||
| Sprint: | Repl 2018-05-21 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
The TopologyCoordinator uses _memberData.getLastAppliedOpTime() to advance the commit point on primaries. _memberData.getLastAppliedOpTime() returns _lastAppliedOpTime, which is set in advanceLastAppliedOpTime(). That is called from setUpValues which is called on heartbeat responses. This is a problem because imagine if we have 3 nodes A, B, and C. A starts as the primary and commits OpTime(Timestamp(1,1), 1) to all nodes. A writes OpTime(Timestamp(2,1), 1) and it replicates to B, but A never receives the acknowledgement and never commits it. A also writes OpTime(Timestamp(3,1), 1). B then runs for election in term 2 and C votes for it since it's ahead. A then steps down and runs for election again in term 3. C votes for it and it wins. B then takes a write at OpTime(Timestamp(4,1), 2) and A takes a write at OpTime(Timestamp(5,1), 3). A then gets a heartbeat from B and hears that it is at OpTime(Timestamp(4,1), 2) and commits all operations less than that, including OpTime(Timestamp(3,1), 1), which is only on itself. If B then runs for election again in term 4, and C votes for it, then A can begin syncing from B and roll back it's majority committed write. It's possible something will prevent the above from happening exactly as stated and it may be easier to reproduce in a 5 node set. That said, it is definitely a problem (and possible currently) for a node to commit operations on its branch of history based on oplog entries with higher optimes than the commit point, but lower terms than its current term (which would not cause a step down). |
| Comments |
| Comment by Tess Avitabile (Inactive) [ 09/May/18 ] |
|
It is safe to use heartbeats to advance the commit point, since the primary checks that it does not advance the commit point to a value lower than the first optime of its term. |
| Comment by Spencer Brody (Inactive) [ 09/May/18 ] |
|
So I think we've all agreed that there's no problem with w:majority write concerns or with committing ops (and thus showing them in readConcern:majority reads) incorrectly with the current implementation (nice reasoning siyuan.zhou!). As for the question of w:<number> write concerns and the 'writtenTo' field, I think we're also safe from those, because for a stale primary to incorrectly use information from another primary to satisfy a write concern, it must hear about an optime newer than the one it's waiting to confirm, thus it must hear about a new term, and we check for term changes when waiting for write concern and never complete a writeConcern request if our term changes mid-way through. So I actually think there's no bug here at all and we can just close this ticket! |
| Comment by Tess Avitabile (Inactive) [ 09/May/18 ] |
|
It is done in advanceLastCommittedOpTime(), which is called by that function. |
| Comment by Matthew Russotto [ 09/May/18 ] |
|
Tess, if we need to fix this, I think your plan is a good one with the caveat that arbiters still need to set that data from heartbeats (because it's the only way they get it) The problem scenario was something like this Primary N1 is happily running along in Term 1, but there's some sort of intermittent network partition Primary N2 is running along in Term 2 (N1 didn't hear anything of the election which caused this) Secondary N3 is in Term 2
N1 does a few writes, and then sees a heartbeat response from N3 which makes it think those writes are majority committed when they are not. If Siyuan is right and we only advance the commit point based on optimes in our own term, this isn't an issue, I think. However, I don't see that test here:
|
| Comment by Tess Avitabile (Inactive) [ 09/May/18 ] |
|
I think we only need to be concerned about using heartbeat data for filling out replSetUpdatePosition if we are concerned about using heartbeat data for anything the optimes in replSetUpdatePosition data are used for. If we are not concerned about w:<number>, custom write concerns, or writtenTo, then I think it is harmless to use heartbeat data to fill out replSetUpdatePosition. Unless we anticipate future bugs by programmers who assume the optimes in replSetUpdatePosition all came from the spanning tree. |
| Comment by Judah Schvimer [ 09/May/18 ] |
|
I agree with tess.avitabile that there is no such thing as a "wrong commit point". The problem I can think of with hearing about a commit point via a heartbeat is thinking operations below the commit point are committed, even though they may still be rolled back. siyuan.zhou, you make a good point about not committing operations until you commit one in your term. I think you're right that this protects us from committing operations incorrectly. I agree with Tess though that using heartbeat data for the other purposes (I'm definitely concerned about filling out updatePosition commands with them) is concerning. |
| Comment by Tess Avitabile (Inactive) [ 09/May/18 ] |
|
That is a very good point. That is my understanding of the Raft protocol as well, and we do perform the check that the primary does not advance the commit point to a value lower than the first optime of its term. So perhaps there is no bug? I'm still concerned about the w:<number> write concerns, custom write concerns, and the writtenTo field in the write concern result, but I'm not sure how much we care about those. Maybe we only care about majority write concern. siyuan.zhou, I don't understand the issue with hearing a wrong commit point from the old primary. I thought that if any primary thought an operation was committed, then we are guaranteed that operation will never roll back. |
| Comment by Siyuan Zhou [ 08/May/18 ] |
|
I don't think it's wrong to have the primary learn of the positions of secondaries from heartbeats because we only allow the primary to advance the commit point that is in its term. The definition of "commit" of an oplog entry with the primary's term is:
Oplog entries with previous terms get "committed" by applying the Log Matching rule after the first optime in the primary's term gets committed.
I don't think any of these conditions are affected by learning the oplog positions via heartbeats. Secondaries' learning the commit point from the primary is different, because secondaries may hear of a wrong commit point from the old primary before or during the old primary's rollback. Instead of our current rule of propagating commit points via spanning tree, I think an alternative rule could be to only accept commit points that are in the secondary's current term, which is implied by our current rule. |
| Comment by Siyuan Zhou [ 08/May/18 ] |
|
I don't think the problem can happen. To make the 3-node replset scenario in the description clearer, I'm giving them numbers.
The problem seem to happen at step 8, but actually A won't advance the commit point after hearing of OpTime(Timestamp(4,1), 2), because according to Raft, the primary cannot advance the commit point until the first optime in its term OpTime(Timestamp(5,1), 3) gets committed. That's the _firstOpTimeOfMyTerm in TopologyCoordinator. Did I miss anything? |
| Comment by Spencer Brody (Inactive) [ 08/May/18 ] |
|
Would be interesting to understand why the read_committed_stale_history.js test (which was added when we fixed this problem the first time in |
| Comment by Spencer Brody (Inactive) [ 08/May/18 ] |
|
Plan sounds good to me. For catchup takeover, I think it would be fine to look at the more recent of the heartbeat or spanning tree optimes. Catchup takeover doesn't affect correctness, but you'd like to avoid doing the takeover if you don't need to whenever possible, so if you learn via a heartbeat that the primary has already caught up before you learn that by the spanning tree, it'd be nice to take that knowledge into account and avoid doing the takeover. |
| Comment by Judah Schvimer [ 08/May/18 ] |
|
siyuan.zhou, do you have any thoughts on catchup takeover? |
| Comment by Tess Avitabile (Inactive) [ 08/May/18 ] |
|
There are several consequences of using optimes from heartbeats to set MemberData::_lastDurableOpTime and MemberData::_lastAppliedOpTime. As in the description, this means that optimes from heartbeats are used to advance the replication commit point. Additionally, it means that optimes from heartbeats are used for:
I believe that all of these are errors, except possibly catchup takeover, since I'm not knowledgable enough when we should initiate catchup takeover. Fortunately, we do not use MemberData::_lastDurableOpTime or MemberData::_lastAppliedOpTime for sync source selection. Instead we use MemberData::getHeartbeatAppliedOpTime(), which uses MemberData::_lastResponse. I think the solution then is to not use heartbeats to update MemberData::_lastDurableOpTime and MemberData::_lastAppliedOpTime. This should have no effect on sync source selection. It should lead to correct behavior for advancing the commit point and all of the above bullet points, except possible catchup takeover, which I would like advice on. judah.schvimer, matthew.russotto, spencer, how does this plan sound to you? And can you advise about catchup takeover? |
| Comment by Judah Schvimer [ 27/Apr/18 ] |
|
One proposed fix (in a throwback to 3.2) is to maintain separate cached optimes for each replica set member for optimes heard from heartbeats and optimes heard from update position commands. The update position optimes could be used to move the commit point, and the greater of the two could be used to choose a sync source candidate. An alternative would be to not cache the optimes in the heartbeats at all. If no sync source is ahead of you, you'd just query the top of each oplog in turn until you found one that's acceptable. This could have a lot of unforeseen negative impacts on spanning tree maintenance. |