[SERVER-9946] Rollback may not happen to primary when resumed from sleep state Created: 17/Jun/13  Updated: 10/Dec/14  Resolved: 06/Dec/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Henrik Ingo (Inactive) Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 12.04 on a Dell XPS13 laptop


Attachments: File rollbacks-when-old-primary-comes-back-take2.log     File rollbacks-when-old-primary-comes-back.log     File rstest.sh    
Issue Links:
Related
Operating System: ALL
Steps To Reproduce:

1) A script is inserting one document per second. (The documents happen to contain output of unix date command.) See rstest.sh.

2) The primary node (27001 in the log) is running in foreground in the shell and is stopped with Ctrl-Z.

3) Another node (27002) becomes primary as designed.

3) More inserts are directed to the new primary (via mongos node).

4) Inserts are stopped. No writes are happening anywhere.

5) A few minutes later the original primary continues (type fg in the shell). It now:

6) proceeds to write to oplog one pending write that it was handling when stopped (this is probably not reproducible 100% of the time, but with some statistical certainty. You can remove the sleep 1 from rstest.sh to increase probability.)

7) the oplog write gets the timestamp of current time, not the time when the insert was sent from client and this node was still primary.

8) the old primary sends heartbeats to other members claiming to be primary and also claiming to have most recent OpTime in its oplog.

9) the new/current primary freaks out, yields, and rolls back all the inserts it had written in (3) above. Joins the old primary as a secondary.

Note: in the syslog there are also repeated calls to FindCommonPoint for he next 2 minutes or so, complaining about a 10 seconds gap in oplogs. I didn't troubleshoot what that is about, but probably there may be another bug lurking somewhere?

Participants:

 Description   

A replica set primary is stopped with Ctrl-Z. After it is resumed, it completes one final write operation, returns to the replica set claiming to be primary and claiming to have a more recent state of the oplog. This causes the current (real) primary to yield and to roll back everything it has written in the meantime.

Proposed fix: The real primary node should be able to know that it has been elected with majorty support, and that writes accepted after that election are valid, and the returning primary should be rejected.



 Comments   
Comment by Eliot Horowitz (Inactive) [ 06/Dec/13 ]

Same issue as SERVER-10768

Comment by Henrik Ingo (Inactive) [ 17/Jun/13 ]

Here's another log file with a slightly different sequence of events that still results in significant rollbacks.

This is the same steps as above, except that node 1 (27001) was given priority 2.

When it wakes up, it actually receives information from node 2 being the new primary before sending any heartbeats itself. Still, a stray insert has found it's way to its oplog and maybe due to the higher priority it now claims the primary role, again forcing node 2 (27002) to roll back everything it has committed while primary. (I can see 2 minutes missing with db.collectionname.find())

Comment by Henrik Ingo (Inactive) [ 17/Jun/13 ]

I wanted to simulate a case were primary stops responding without triggering a close of the tcp/ip socket. Suspending the process was a way to do that, alternatively I could enclose wit a firewall that silently drops packets, and in real life of course this could happen for whatever network interruptions.

@Scott: To reproduce it is of course necessary that primary is stopped exactly at a time when it has received an insert but not yet written it to an oplog. Then it should succeed in writing to oplog before it sends out the first post-resume heartbeat, so that the first heartbeat will include the new advanced oplog state / optime. So yes, it's very much a race, but I hope the attached log makes it clear what happened (maybe I was lucky).

@Eric: I think the real bug is not whether primary was stopped with Ctrl-Z or something else. The problem is that the new primary (27002) is willing to rollback an arbitrarily long history of operations simply because a single node enters the replica set and erroneously claims that it should do so. At least from what I see here, there's nothing that would prevent losing hours of data due to a glitch like this:

  • node 1 is primary at time T0
  • node 1 is lost
  • node 2 assumes primary at time T1
  • node 2 runs as primary for hours, committing millions of operations to oplog
  • node 1 wakes up, commits a single operation with a optime higher than anything on node 2
  • node 1 re-joins replica set, sends heartbeats to everyone else claiming to be primary and claiming to have the most up to date oplog
  • (note that the above 2 points could be due to any bug or malfunction, the point is that node 2 needs to be empowered to protect against these invalid claims)
  • election protocol favors the most recent oplog so node 1 wins and node 2 has to rollback everything that happened for as long as it was primary

What should happen is that node 2 has recorded the fact that it won an election at time T1. Since node 1 claims to have committed operations for optime > T1, it should reject the claim of node 1 to be primary. Now node 2 and node 1 need to compare the times when they have been elected as primaries. node 1 will submit T0 and node 2 will submit T1. Seeing that T1 > T0 and it follows that anything node 2 has committed after T1 is valid and anything node 1 has committed after T1 is invalid. Hence node 1 needs to step down and roll back and node 2 may continue as primary.

Note that instead of comparing T0 and T1 as a separate step, more natural would be to add that information to the current heartbeat ping sent by the primary so that this comparison can be done immediately as part of the existing algorithm.

Note that with our election protocol it is never possible (except for short overlapping times of less than 2 seconds, but even this seems to be guarded against) for more than 1 node to actually win majority support in an election, so referring back to election results is a good way to decide between competing claims for being in primary state at a given point in time.

Comment by Eric Milkie [ 17/Jun/13 ]

To help me prioritize this, what's the use case here? I believe suspending the process is simulating something that might happen in real life, but I would like to know specifics.

Comment by Scott Hernandez (Inactive) [ 17/Jun/13 ]

In a reproduction I found it went the other way every time.

I'm not sure what you are suggesting is a good behavior or even possible to determine conclusively.

Generated at Thu Feb 08 03:21:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.