[SERVER-34728] Heartbeats are used to advance replication commit point Created: 27/Apr/18  Updated: 27/Oct/23  Resolved: 09/May/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.14, 3.6.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Judah Schvimer Assignee: Tess Avitabile (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-27123 Only update commit point via spanning... Closed
related to SERVER-29076 Replace all usage of heartbeat op tim... Backlog
related to SERVER-29079 Unify liveness information between sp... Backlog
related to SERVER-26990 Unify tracking of secondary state bet... Closed
related to SERVER-29078 Eliminate use of memberHeartbeatData ... Closed
Operating System: ALL
Backport Requested:
v3.6, v3.4
Sprint: Repl 2018-05-21
Participants:

 Description   

The TopologyCoordinator uses _memberData.getLastAppliedOpTime() to advance the commit point on primaries. _memberData.getLastAppliedOpTime() returns _lastAppliedOpTime, which is set in advanceLastAppliedOpTime(). That is called from setUpValues which is called on heartbeat responses.

This is a problem because imagine if we have 3 nodes A, B, and C. A starts as the primary and commits OpTime(Timestamp(1,1), 1) to all nodes. A writes OpTime(Timestamp(2,1), 1) and it replicates to B, but A never receives the acknowledgement and never commits it. A also writes OpTime(Timestamp(3,1), 1). B then runs for election in term 2 and C votes for it since it's ahead. A then steps down and runs for election again in term 3. C votes for it and it wins. B then takes a write at OpTime(Timestamp(4,1), 2) and A takes a write at OpTime(Timestamp(5,1), 3). A then gets a heartbeat from B and hears that it is at OpTime(Timestamp(4,1), 2) and commits all operations less than that, including OpTime(Timestamp(3,1), 1), which is only on itself. If B then runs for election again in term 4, and C votes for it, then A can begin syncing from B and roll back it's majority committed write.

It's possible something will prevent the above from happening exactly as stated and it may be easier to reproduce in a 5 node set. That said, it is definitely a problem (and possible currently) for a node to commit operations on its branch of history based on oplog entries with higher optimes than the commit point, but lower terms than its current term (which would not cause a step down).



 Comments   
Comment by Tess Avitabile (Inactive) [ 09/May/18 ]

It is safe to use heartbeats to advance the commit point, since the primary checks that it does not advance the commit point to a value lower than the first optime of its term.

Comment by Spencer Brody (Inactive) [ 09/May/18 ]

So I think we've all agreed that there's no problem with w:majority write concerns or with committing ops (and thus showing them in readConcern:majority reads) incorrectly with the current implementation (nice reasoning siyuan.zhou!).

As for the question of w:<number> write concerns and the 'writtenTo' field, I think we're also safe from those, because for a stale primary to incorrectly use information from another primary to satisfy a write concern, it must hear about an optime newer than the one it's waiting to confirm, thus it must hear about a new term, and we check for term changes when waiting for write concern and never complete a writeConcern request if our term changes mid-way through.

So I actually think there's no bug here at all and we can just close this ticket!

Comment by Tess Avitabile (Inactive) [ 09/May/18 ]

It is done in advanceLastCommittedOpTime(), which is called by that function.

Comment by Matthew Russotto [ 09/May/18 ]

Tess, if we need to fix this, I think your plan is a good one with the caveat that arbiters still need to set that data from heartbeats (because it's the only way they get it)

The problem scenario was something like this

Primary N1 is happily running along in Term 1, but there's some sort of intermittent network partition

Primary N2 is running along in Term 2 (N1 didn't hear anything of the election which caused this)

Secondary N3 is in Term 2

 

N1 does a few writes, and then sees a heartbeat response from N3 which makes it think those writes are majority committed when they are not.

If Siyuan is right and we only advance the commit point based on optimes in our own term, this isn't an issue, I think.  However, I don't see that test here:

https://github.com/mongodb/mongo/blob/afb535dfea19036530adee8ddc940e1bcb9794cc/src/mongo/db/repl/topology_coordinator.cpp#L3001

 

Comment by Tess Avitabile (Inactive) [ 09/May/18 ]

I think we only need to be concerned about using heartbeat data for filling out replSetUpdatePosition if we are concerned about using heartbeat data for anything the optimes in replSetUpdatePosition data are used for. If we are not concerned about w:<number>, custom write concerns, or writtenTo, then I think it is harmless to use heartbeat data to fill out replSetUpdatePosition. Unless we anticipate future bugs by programmers who assume the optimes in replSetUpdatePosition all came from the spanning tree.

Comment by Judah Schvimer [ 09/May/18 ]

I agree with tess.avitabile that there is no such thing as a "wrong commit point". The problem I can think of with hearing about a commit point via a heartbeat is thinking operations below the commit point are committed, even though they may still be rolled back. 

siyuan.zhou, you make a good point about not committing operations until you commit one in your term. I think you're right that this protects us from committing operations incorrectly. I agree with Tess though that using heartbeat data for the other purposes (I'm definitely concerned about filling out updatePosition commands with them) is concerning.

Comment by Tess Avitabile (Inactive) [ 09/May/18 ]

That is a very good point. That is my understanding of the Raft protocol as well, and we do perform the check that the primary does not advance the commit point to a value lower than the first optime of its term. So perhaps there is no bug?

I'm still concerned about the w:<number> write concerns, custom write concerns, and the writtenTo field in the write concern result, but I'm not sure how much we care about those. Maybe we only care about majority write concern.

siyuan.zhou, I don't understand the issue with hearing a wrong commit point from the old primary. I thought that if any primary thought an operation was committed, then we are guaranteed that operation will never roll back.

Comment by Siyuan Zhou [ 08/May/18 ]

I don't think it's wrong to have the primary learn of the positions of secondaries from heartbeats because we only allow the primary to advance the commit point that is in its term. The definition of "commit" of an oplog entry with the primary's term is:

  1. The oplog entry has been replicated to a majority
  2. The primary learns of the above fact in its term. This rule is our extension of the original Raft protocol due to SERVER-22136.

Oplog entries with previous terms get "committed" by applying the Log Matching rule after the first optime in the primary's term gets committed.

If two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index. 

I don't think any of these conditions are affected by learning the oplog positions via heartbeats.

Secondaries' learning the commit point from the primary is different, because secondaries may hear of a wrong commit point from the old primary before or during the old primary's rollback. Instead of our current rule of propagating commit points via spanning tree, I think an alternative rule could be to only accept commit points that are in the secondary's current term, which is implied by our current rule.

Comment by Siyuan Zhou [ 08/May/18 ]

I don't think the problem can happen. To make the 3-node replset scenario in the description clearer, I'm giving them numbers.

  1. A starts as the primary and commits OpTime(Timestamp(1,1), 1) to all nodes.
  2. A writes OpTime(Timestamp(2,1), 1) and it replicates to B, but A never receives the acknowledgement and never commits it.
  3. A also writes OpTime(Timestamp(3,1), 1).
  4. B then runs for election in term 2 and C votes for it since it's ahead.
  5. A then steps down and runs for election again in term 3. C votes for it and it wins.
  6. B then takes a write at OpTime(Timestamp(4,1), 2)
  7. A takes a write at OpTime(Timestamp(5,1), 3).
  8. A then gets a heartbeat from B and hears that it is at OpTime(Timestamp(4,1), 2) and commits all operations less than that, including OpTime(Timestamp(3,1), 1), which is only on itself.
  9. B then runs for election again in term 4, and C votes for it, then A can begin syncing from B and roll back it's majority committed write.

The problem seem to happen at step 8, but actually A won't advance the commit point after hearing of OpTime(Timestamp(4,1), 2), because according to Raft, the primary cannot advance the commit point until the first optime in its term OpTime(Timestamp(5,1), 3) gets committed. That's the _firstOpTimeOfMyTerm in TopologyCoordinator. Did I miss anything?

Comment by Spencer Brody (Inactive) [ 08/May/18 ]

Would be interesting to understand why the read_committed_stale_history.js test (which was added when we fixed this problem the first time in SERVER-27123) isn't failing now.

Comment by Spencer Brody (Inactive) [ 08/May/18 ]

Plan sounds good to me.

For catchup takeover, I think it would be fine to look at the more recent of the heartbeat or spanning tree optimes.  Catchup takeover doesn't affect correctness, but you'd like to avoid doing the takeover if you don't need to whenever possible, so if you learn via a heartbeat that the primary has already caught up before you learn that by the spanning tree, it'd be nice to take that knowledge into account and avoid doing the takeover.

Comment by Judah Schvimer [ 08/May/18 ]

siyuan.zhou, do you have any thoughts on catchup takeover?

Comment by Tess Avitabile (Inactive) [ 08/May/18 ]

There are several consequences of using optimes from heartbeats to set MemberData::_lastDurableOpTime and MemberData::_lastAppliedOpTime. As in the description, this means that optimes from heartbeats are used to advance the replication commit point. Additionally, it means that optimes from heartbeats are used for:

I believe that all of these are errors, except possibly catchup takeover, since I'm not knowledgable enough when we should initiate catchup takeover.

Fortunately, we do not use MemberData::_lastDurableOpTime or MemberData::_lastAppliedOpTime for sync source selection. Instead we use MemberData::getHeartbeatAppliedOpTime(), which uses MemberData::_lastResponse. I think the solution then is to not use heartbeats to update MemberData::_lastDurableOpTime and MemberData::_lastAppliedOpTime. This should have no effect on sync source selection. It should lead to correct behavior for advancing the commit point and all of the above bullet points, except possible catchup takeover, which I would like advice on.

judah.schvimer, matthew.russotto, spencer, how does this plan sound to you? And can you advise about catchup takeover?

Comment by Judah Schvimer [ 27/Apr/18 ]

One proposed fix (in a throwback to 3.2) is to maintain separate cached optimes for each replica set member for optimes heard from heartbeats and optimes heard from update position commands. The update position optimes could be used to move the commit point, and the greater of the two could be used to choose a sync source candidate.

An alternative would be to not cache the optimes in the heartbeats at all. If no sync source is ahead of you, you'd just query the top of each oplog in turn until you found one that's acceptable. This could have a lot of unforeseen negative impacts on spanning tree maintenance.

Generated at Thu Feb 08 04:37:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.