[SERVER-39367] lastOpCommitted being reset on restart can cause sync source cycle Created: 04/Feb/19  Updated: 29/Oct/23  Resolved: 27/Feb/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.9

Type: Bug Priority: Major - P3
Reporter: Judah Schvimer Assignee: Siyuan Zhou
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File sync_source_cycle.js    
Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-39497 Revert SERVER-33248 Closed
Related
related to SERVER-40193 Do not propagate commit point through... Closed
related to SERVER-40194 Revert SERVER-33248 on 4.0 Closed
is related to SERVER-27123 Only update commit point via spanning... Closed
is related to SERVER-39353 disable chaining in rollback_test.js Closed
is related to SERVER-39497 Revert SERVER-33248 Closed
is related to SERVER-39626 Majority committed oplog entries may ... Closed
is related to SERVER-39831 Never update commit point beyond last... Closed
is related to SERVER-33248 Allow choosing a sync source that we ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0
Sprint: Repl 2019-03-11
Participants:
Linked BF Score: 15

 Description   

Consider two nodes in a set, nodes A and B. Node A restarts (resetting its lastOpCommitted to 0) and is ahead of node B. Node B syncs from A. A has no lastOpCommitted so A chooses to sync from B which has a non-zero lastOpCommitted, even though B is already syncing from A.



 Comments   
Comment by William Schultz (Inactive) [ 15/Mar/19 ]

To clarify, the multi node cycle repro referenced above is based on the original protocol where we propagate commit points only via the sync source spanning tree, and where we allow choosing a sync source with a higher commit point (SERVER-33248). It does not use the technique of propagating commit points via heartbeats, which is the new fix proposal implemented in this ticket. The intent was to demonstrate a more general case of the original bug (a 2 node cycle) brought up in this ticket.

Comment by William Schultz (Inactive) [ 12/Mar/19 ]

Here is a link to the raw trace. To summarize the essence of it though, consider the following progression in a 4 node replica set (the values that change in each step are highlighted in red):

State 1 (Initial State)

index 1 2       
n1 1 1   commitPoint=2 syncSource=none
n2 1 1   commitPoint=0 syncSource=n1 
n3       commitPoint=0 syncSource=n1 
n4 1 1   commitPoint=0 syncSource=n2 

State 2

index 1 2      
n1 1 1    commitPoint=2 syncSource=none
n2 1 1   commitPoint=0 syncSource=n1
n3 1     commitPoint=2 syncSource=n1
n4 1 1   commitPoint=0 syncSource=n2

State 3

index 1 2       
n1 1 1   commitPoint=2 syncSource=none
n2 1 1   commitPoint=0 syncSource=n1
n3 1     commitPoint=2 syncSource=n4
n4 1 1   commitPoint=0 syncSource=n2

State 4

index 1 2       
n1 1 1   commitPoint=2 syncSource=none
n2 1 1   commitPoint=0 syncSource=n1
n3 1 1   commitPoint=2 syncSource=n4
n4 1 1   commitPoint=0 syncSource=n2

State 5

index 1 2       
n1 1 1   commitPoint=2 syncSource=none
n2 1 1   commitPoint=0 syncSource=n3
n3 1 1   commitPoint=2 syncSource=n4
n4 1 1   commitPoint=0 syncSource=n2

It should be clear that State 1 is an easy state to end up in by simple steady state operation of the protocol. In State 2, n3 syncs an entry from n1 and advances its commit point. In State 3, n3 then decides to switch sync sources and sync from n4, which is legitimate, since n4's log is newer than n2. In State 4, n3 then replicates a new log entry from n4, so that the logs of all nodes are now the same. Then, in State 5, n2 decides to switch sync sources and sync from n3, which is legitimate, since the commit point of n3 is higher than its own, even though their logs are the same. This creates the sync cycle of n2->n3->n4->n2. I was not able to reproduce a similar case in a 3 node replica set. 

Comment by Siyuan Zhou [ 12/Mar/19 ]

Using TLC model checker, william.schultz found that it's possible to form a sync source cycle with more than 2 nodes. william.schultz, could you please post the TLC trace here? It can only happen in a replset of at least 4 nodes, right?

I'm asking because detecting 2-node cycle in shouldChangeSyncSource() is trivial since we know the sync source of our sync source. This might be a potential workaround in backport to avoid cycle in mixed-version replset.

Never mind about the backport, a multi-node cycle could happen more frequently in mixed-version replsets.

Comment by Tess Avitabile (Inactive) [ 27/Feb/19 ]

As I recall, we agreed that we would not re-enable chaining in the rollback test fixture. Although it was valuable to catch this sync source cycle bug, the intention of that fixture is not to test for sync source cycles, and enabling chaining in the fixture requires us to be very careful about when replication is enabled on the tiebreaker node. I think we can close SERVER-39497.

Comment by Siyuan Zhou [ 27/Feb/19 ]

judah.schvimer, my patch didn't revert SERVER-39353 to enable chaining in the rollback test. Would you mind me repurposing SERVER-39497 to revert SERVER-39353? SERVER-39497 has been done as part of my patch anyway.

Comment by Githook User [ 27/Feb/19 ]

Author:

{'name': 'Siyuan Zhou', 'username': 'visualzhou', 'email': 'siyuan.zhou@mongodb.com'}

Message: SERVER-39367 Advance commit point when it has the same term as the last applied

Comment by Judah Schvimer [ 26/Feb/19 ]

This problem and SERVER-33248 are both liveness problems and can be solved by restart or a replSetSyncFrom command

Are you saying the "minimal backport" will not fix the sync source cycle problem completely?

take the minimum of commit point and last applied on learning via spanning tree

One side effect of this will be that it will be slightly slower to advance the commit point on secondaries since the commit point cannot be ahead of lastApplied even if they're in the same term. This could cause some cache pressure, but I don't think a significant amount.

Comment by Tess Avitabile (Inactive) [ 25/Feb/19 ]

siyuan.zhou, yes, that solution sounds right to me.

Comment by Siyuan Zhou [ 25/Feb/19 ]

judah.schvimer, yes. According to our discussion at the standup. We'll seek a minimal backport to 3.6 and (perhaps 3.4) by adding the term check. This problem and SERVER-33248 are both liveness problems and can be solved by restart or a replSetSyncFrom command. A new write will resolve SERVER-33248 and maxSyncSourceLagSecs (default to 30secs) can break the sync source cycle.

Mistaking the diverged branch as committed is more about a correctness issue though. The above proposal to take the minimum of commit point and last applied on learning via spanning tree seems to fix the correctness issue as well. We may be able to just back port this fix and leaving the liveness issues out for now.

Comment by Siyuan Zhou [ 25/Feb/19 ]

tess.avitabile has a concern that if a secondary is lagged, then it won't be able to update its commit point and throw away history.

Actually, when a secondary is lagged, it can still learn of the latest commit point far ahead of it as long as the commit point's term is the same as its last applie's, so the gap should be the same as that on other up-to-date nodes, even if the stale node cannot update its commit point after the failover until it reaches the latest term. The only problem I can think of is after restarting the stale node, it forgot the commit point in the old term and isn't able to update its commit point.

One solution is that we can relax the term check when learning from sync source but only update its commit point to min(commit point, my last applied). Given that requiring the term check on learning commit point ensures that the commit point is always on a node's branch, spanning tree ensures the syncing node is on the same branch as the sync source, so the syncing node knows it's on the same branch as the commit point even if they have different terms. CC tess.avitabile, if this sounds right to you, I can file another ticket for it.

Comment by Judah Schvimer [ 25/Feb/19 ]

siyuan.zhou, does the above describe a bug in previous versions? If so, I think we need to fix that on 3.4 and 3.6 in addition to 4.0 and 4.2 which will likely get this sync source cycle fix. This seems like a bug we could easily miss since I think it would only occur doing majority reads on secondaries. Is that correct?

Comment by Siyuan Zhou [ 24/Feb/19 ]

The current way of learning commit point from spanning tree can still learn of a commit point in higher terms from its sync source, then switch to a stale branch and mark the stale branch as committed by mistake. Here's a concrete example with 5 nodes. Initially Node A is the primary in term 1. Node E is delayed.

A: [1] [1]
B: [1] [1]
C: [1] [1]
D: [1] [1]
E: [1]

Node B steps up in term 2, writes an entry.

A: [1] [1]
B: [1] [1] [2]
C: [1] [1]
D: [1] [1]
E: [1]

Node A steps up again in term 3 with votes from A, C and D and make its new oplog entry in term 3 majority committed.

A: [1] [1] [3]
B: [1] [1] [2]
C: [1] [1] [3]
D: [1] [1] [3]
E: [1]

Node E syncs from A, and learned the new commit point in term 3.

A: [1] [1] [3]
B: [1] [1] [2]
C: [1] [1] [3]
D: [1] [1] [3]
E: [1] [1]           // It knows of the latest commit point in term 3

Then Node E switches its sync source to B, replicates the stale branch of term 2 and mistakes that branch as committed.

A: [1] [1] [3]
B: [1] [1] [2]
C: [1] [1] [3]
D: [1] [1] [3]
E: [1] [1] [2]   // It mistakes the entry in term 2 as committed.

This problem can be solved by only learning the commit point if it's in the same branch as mine, by comparing the terms of my lastApplied and the commit point.

Comment by William Schultz (Inactive) [ 07/Feb/19 ]

I'm on board with the heartbeat propagation solution proposed by siyuan.zhou as well. If it's easier to implement than the alternative, it seems to safely achieve the desired goal.

Comment by Judah Schvimer [ 07/Feb/19 ]

tess.avitabile, if we wanted to go with updating lastOpCommitted based on a validated sync source candidate (we would still have to do some of the OplogFetcher::checkRemoteOplogStart checks), I would do that on top of SERVER-33248 rather than reverting it, given the pains we had across the board before SERVER-33248.

siyuan.zhou, I think your solution regarding heartbeats is safe and I really like it. It seems clean, easy, and accomplishes the same goal. I would propose potentially doing this in conjunction with a partial revert of SERVER-33248.

Comment by Siyuan Zhou [ 07/Feb/19 ]

Will made a good point. In general, imagine two branches of the spanning tree sharing the same root (the primary), both branches can propagate their oplog entries and timestamps at their own paces. When the nodes from two branches check with each other, the problem in description becomes possible. Will gave a concrete example.

The problem of propagation of commit point via heartbeat on secondaries is that the commit point can be from a diverged history. We force the commit point to flow through the spanning tree to make sure the commit point is on the same history branch. Alternatively, we could only update the commit point via heartbeat on secondaries if the commit point's term is the same as my last applied's term. Thus we are guaranteed to be on the same branch with the commit point. We can revert SERVER-33248 and have commit point propagate via heartbeats then.

Comment by Tess Avitabile (Inactive) [ 05/Feb/19 ]

I like the idea of allowing a node to update its lastOpCommitted based on a sync source candidate, even if it cannot select that sync source. I wonder if it's better to revert SERVER-33248 first and then try this change, or try to do them at the same time, so that we don't affect tests. What do you think?

Comment by Judah Schvimer [ 05/Feb/19 ]

I think reverting SERVER-33248 is probably the best way forward, though not having SERVER-33248 caused many headaches, so I want to consider alternatives first.

I've been thinking about two ideas, though:

  1. Can OplogFetcher:checkRemoteOplogStart say that the sync source is invalid if the lastApplied optimes are equal but the lastOpCommitted optimes are not, but still allow the OplogFetcher to process the metadata it receives? This could allow us to say a sync source is invalid (preventing a cycle), but still advance our lastOpCommitted safely (since we've acknowledged that in this state it is safe to trust the sync source's metadata, it just may cause a cycle).
  2. When chaining is disabled, we attempt to sync from the primary no matter what and then OplogFetcher:checkRemoteOplogStart says it is invalid if it is not ahead. If there are no valid sync sources, and we implemented (1) above, we could choose our knowledge of the primary as our sync source candidate, and only use it for its metadata. I do not think it would be safe to sync from nodes we think are the primary if they are only ahead of us by lastOpCommitted, since there could still be a cycle of outdated information about the current primary. However, there is little harm in sending an extra find command to a node if the alternative is doing nothing. I'm also, however, not 100% sure that this solves the problems SERVER-33248. Are there circumstances where the node we think is primary does not have the most up-to-date lastOpCommitted? I think that if the cluster is healthy (which is the state SERVER-33248 is trying to solve), this is a fair assumption, at least eventually (which is all that matters for SERVER-33248).

It seems william.schultz concurrently had a similar idea.

Comment by William Schultz (Inactive) [ 05/Feb/19 ]

One other thought: For two nodes A, B where:

  1. A.lastApplied = B.lastApplied
  2. A.lastCommitted < B.lastCommitted

we could consider allowing node A to update its lastCommittedOpTime from B without choosing it as a sync source. It should be safe for A to learn of a new commit point from a node that has the same oplog as itself, but we wouldn't allow A to actually start syncing from B. This may address the original issue raised in SERVER-33248 about secondary majority reads not progressing, but avoid the sync source cycle formation scenario outlined above.

Comment by William Schultz (Inactive) [ 05/Feb/19 ]

judah.schvimer Sorry, there was a typo in step 7 of my previous comment. The full sequence with step 7 fixed is instead:

  1. All nodes initially have lastApplied=1, lastCommitted=1. n0 is primary.
  2. n0 applies operations at timestamps 2, 3, and 4 updates it lastApplied to 4.
  3. n0 replicates all of its operations to n1, updating n1.lastApplied=4.
  4. n1 gets partitioned from all other nodes.
  5. n0 replicates its operation at timestamp 2 to nodes n2 and n3 and it becomes majority committed.
  6. n2 and n3 replicate the operation at timestamp 3 from the primary and learn of the new commit point (timestamp 2) at the same time. At this point we have n0.lastCommitted=2, n2.lastCommitted=2 and n3.lastCommitted=2.
  7. We now have n2.lastApplied=3, n2.lastCommitted=2 and n1.lastApplied=4, n1.lastCommitted=1.
  8. n2 chooses to sync from n1 because n1.lastApplied=4 > n2.lastApplied=3
  9. n2 catches up with n1 so that n2.lastApplied=n1.lastApplied=4. We still have n2.lastCommitted=2
  10. n1 chooses to sync from n2 because n2.lastCommitted=2 > n1.lastCommitted=1.

I wrote "n0" instead of "n1" in step 7 in the original comment. Step 7 is what sets the stage for a possible cycle. When I said "pre-condition", I meant that it was a state that could allow for a cycle to subsequently form based solely on the sync source selection rules, even though it might not. The last 3 steps I added produce the actual cycle. It looks like tess.avitabile already corrected my errors and demonstrated an analogous cycle case above.

Comment by Tess Avitabile (Inactive) [ 05/Feb/19 ]

Once we have state 7, we can get to a sync source cycle:

8. n2 syncs from n0 because n2.lastApplied=3<4=n0.lastApplied.

9. n2 catches up to n0, so n2.lastApplied=4=n0.lastApplied.

10. n0 syncs from n2 because n0.lastApplied=4=n2.lastApplied and n0.lastCommitted=1<2=n2.lastCommitted.

I think we may need to revert SERVER-33248. I think the fallacy of that work was the assumption that if A is syncing from B, then A.lastCommitted<=B.lastCommitted. Another option is to add the requirement for sync source selection that source.lastCommitted>=self.lastCommitted. However, this may have unintended consequences.

Comment by Judah Schvimer [ 05/Feb/19 ]

n2 and n3 replicate the operation at timestamp 3 from the primary and learn of the new commit point (timestamp 2) at the same time. At this point we have n0.lastCommitted=2, n2.lastCommitted=2 and n3.lastCommitted=2.
We now have n2.lastApplied=3, n2.lastCommited=2 and n0.lastApplied=4, n0.lastCommitted=1.

william.schultz, these two lines seem to disagree on the state of n0.

Also, just being at state 7 does not mean we'll develop a cycle. At step 7, the node with the higher last applied should not choose the other node as a sync source because it has a lower lastApplied, even though its lastCommitted is greater.

Comment by William Schultz (Inactive) [ 05/Feb/19 ]

I do think that it is possible in general for two replica set nodes, A, B to satisfy:

  1. A.lastApplied > B.lastApplied
  2. A.lastCommitted < B.lastCommitted
  3. A.lastCommitted != 0

The first two bullets seem to be the key pre-condition for sync source cycle formation. Consider the following scenario in a 4 node replica set with nodes n0, n1, n2, n3:

  1. All nodes initially have lastApplied=1, lastCommitted=1. n0 is primary.
  2. n0 applies operations at timestamps 2, 3, and 4 updates it lastApplied to 4.
  3. n0 replicates all of its operations to n1, updating n1.lastApplied=4.
  4. n1 gets partitioned from all other nodes.
  5. n0 replicates its operation at timestamp 2 to nodes n2 and n3 and it becomes majority committed.
  6. n2 and n3 replicate the operation at timestamp 3 from the primary and learn of the new commit point (timestamp 2) at the same time. At this point we have n0.lastCommitted=2, n2.lastCommitted=2 and n3.lastCommitted=2.
  7. We now have n2.lastApplied=3, n2.lastCommited=2 and n0.lastApplied=4, n0.lastCommitted=1.

This establishes the pre-condition referenced above, and should be sufficient for a sync source cycle to subsequently form. If n2 and n1 were to be re-connected directly to each other, they would be able to form a 2 node cycle. This repro (sync_source_cycle.js) demonstrates a roughly analogous case.

Comment by Judah Schvimer [ 04/Feb/19 ]

This fix should revert SERVER-39353 (18dde98af61a71b1bc5ec990c2184c5725d9b0c0) to re-enable chaining in rollback_test.js.

Comment by Judah Schvimer [ 04/Feb/19 ]

A key question that william.schultz raised is: "is it impossible for two nodes A, B to have A.lastApplied > B.lastApplied but A.lastCommitted < B.lastCommitted?", and if it's possible, is it only possible when A.lastCommitted = 0?

Generated at Thu Feb 08 04:51:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.