[SERVER-9946] Rollback may not happen to primary when resumed from sleep state Created: 17/Jun/13 Updated: 10/Dec/14 Resolved: 06/Dec/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Henrik Ingo (Inactive) | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 12.04 on a Dell XPS13 laptop |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Steps To Reproduce: | 1) A script is inserting one document per second. (The documents happen to contain output of unix date command.) See rstest.sh. 2) The primary node (27001 in the log) is running in foreground in the shell and is stopped with Ctrl-Z. 3) Another node (27002) becomes primary as designed. 3) More inserts are directed to the new primary (via mongos node). 4) Inserts are stopped. No writes are happening anywhere. 5) A few minutes later the original primary continues (type fg in the shell). It now: 6) proceeds to write to oplog one pending write that it was handling when stopped (this is probably not reproducible 100% of the time, but with some statistical certainty. You can remove the sleep 1 from rstest.sh to increase probability.) 7) the oplog write gets the timestamp of current time, not the time when the insert was sent from client and this node was still primary. 8) the old primary sends heartbeats to other members claiming to be primary and also claiming to have most recent OpTime in its oplog. 9) the new/current primary freaks out, yields, and rolls back all the inserts it had written in (3) above. Joins the old primary as a secondary. Note: in the syslog there are also repeated calls to FindCommonPoint for he next 2 minutes or so, complaining about a 10 seconds gap in oplogs. I didn't troubleshoot what that is about, but probably there may be another bug lurking somewhere? |
||||
| Participants: | |||||
| Description |
|
A replica set primary is stopped with Ctrl-Z. After it is resumed, it completes one final write operation, returns to the replica set claiming to be primary and claiming to have a more recent state of the oplog. This causes the current (real) primary to yield and to roll back everything it has written in the meantime. Proposed fix: The real primary node should be able to know that it has been elected with majorty support, and that writes accepted after that election are valid, and the returning primary should be rejected. |
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 06/Dec/13 ] |
|
Same issue as |
| Comment by Henrik Ingo (Inactive) [ 17/Jun/13 ] |
|
Here's another log file with a slightly different sequence of events that still results in significant rollbacks. This is the same steps as above, except that node 1 (27001) was given priority 2. When it wakes up, it actually receives information from node 2 being the new primary before sending any heartbeats itself. Still, a stray insert has found it's way to its oplog and maybe due to the higher priority it now claims the primary role, again forcing node 2 (27002) to roll back everything it has committed while primary. (I can see 2 minutes missing with db.collectionname.find()) |
| Comment by Henrik Ingo (Inactive) [ 17/Jun/13 ] |
|
I wanted to simulate a case were primary stops responding without triggering a close of the tcp/ip socket. Suspending the process was a way to do that, alternatively I could enclose wit a firewall that silently drops packets, and in real life of course this could happen for whatever network interruptions. @Scott: To reproduce it is of course necessary that primary is stopped exactly at a time when it has received an insert but not yet written it to an oplog. Then it should succeed in writing to oplog before it sends out the first post-resume heartbeat, so that the first heartbeat will include the new advanced oplog state / optime. So yes, it's very much a race, but I hope the attached log makes it clear what happened (maybe I was lucky). @Eric: I think the real bug is not whether primary was stopped with Ctrl-Z or something else. The problem is that the new primary (27002) is willing to rollback an arbitrarily long history of operations simply because a single node enters the replica set and erroneously claims that it should do so. At least from what I see here, there's nothing that would prevent losing hours of data due to a glitch like this:
What should happen is that node 2 has recorded the fact that it won an election at time T1. Since node 1 claims to have committed operations for optime > T1, it should reject the claim of node 1 to be primary. Now node 2 and node 1 need to compare the times when they have been elected as primaries. node 1 will submit T0 and node 2 will submit T1. Seeing that T1 > T0 and it follows that anything node 2 has committed after T1 is valid and anything node 1 has committed after T1 is invalid. Hence node 1 needs to step down and roll back and node 2 may continue as primary. Note that instead of comparing T0 and T1 as a separate step, more natural would be to add that information to the current heartbeat ping sent by the primary so that this comparison can be done immediately as part of the existing algorithm. Note that with our election protocol it is never possible (except for short overlapping times of less than 2 seconds, but even this seems to be guarded against) for more than 1 node to actually win majority support in an election, so referring back to election results is a good way to decide between competing claims for being in primary state at a given point in time. |
| Comment by Eric Milkie [ 17/Jun/13 ] |
|
To help me prioritize this, what's the use case here? I believe suspending the process is simulating something that might happen in real life, but I would like to know specifics. |
| Comment by Scott Hernandez (Inactive) [ 17/Jun/13 ] |
|
In a reproduction I found it went the other way every time. I'm not sure what you are suggesting is a good behavior or even possible to determine conclusively. |