[SERVER-13995] on stepdown/election, SECONDARY should consume all available oplog Created: 19/May/14 Updated: 17/Feb/15 Resolved: 17/Feb/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kenny Gorman | Assignee: | Eric Milkie |
| Resolution: | Duplicate | Votes: | 8 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
If the user performs a stepDown on an otherwise healthy PRIMARY and the SECONDARY has N seconds of lag, then that N seconds of activity is lost. If the PRIMARY is otherwise OK, then the SECONDARY should try to consume all possible oplog while the election takes place (no activity to cluster during this time). This allows users to call stepDown on busy sets that may always have lag and ensure they aren't dropping data on the floor when they don't need to. |
| Comments |
| Comment by Eric Milkie [ 17/Feb/15 ] |
|
The replSetStepDown command now takes a timeout period to allow an admin to avoid rollbacks. |
| Comment by Andy Schwerin [ 12/Feb/15 ] |
|
kennygorman, in 3.0 we've changed the stepDown behavior as part of I'm tempted to resolve this issue as fixed by |
| Comment by Kenny Gorman [ 24/Jul/14 ] |
|
Any more thoughts about implementing a fix for this condition? |
| Comment by charity majors [ 21/May/14 ] |
|
Something very similar bit us recently. We had two secondaries doing foreground indexes, one ~3 hours behind but priority 0 and the other ~6 hours behind. Heartbeat flapping forced an election and it rolled back to the secondary that was ~6 hours behind. The old primary entered ROLLBACK state (but couldn't roll back because it had more then 300 mb of ops) and the other secondary entered FATAL state because it was ahead of the new primary. This was pretty terrible. Not sure what the correct solution is here, maybe back off forcing an election when heartbeats are flapping, or when the secondary is very far behind? Replaying the ops on the secondary when it is not too far behind seems like possibly a good idea, not sure what other terrible failure scenarios this could cause. |
| Comment by David Murphy [ 21/May/14 ] |
|
Hi Andy , I agree however in one case we had Primary was healthy A heartbeat issue occurred causing an election , the old primary was not selected because of this error and the secondary that was behind was. This resulted in 3hours of w:1 type data being removed. Which should not have been the case. If majority was used you are correct and this could not happen , however I think w:1 is still a very common level to be on and we should protect it also. Additionally there is a case where if you have 1 secondary catching up and something happens to another around the time you do a stepDown, only 1 viable candidate is found, other than the old primary. In this case I think the position is we should either re-elect the old primary as the secondary is lagged and would do a rollback , even though the other machine that went down was more updated. My thought is in today's model with no change we should have a situation where an election holds for 5 minute ( like in the veto case of a primary sometimes) waiting for the 3rd secondary which is more updated to come back online and catch up, or for the timeout to hit so the old primary can be elected and the "lagged" secondary should not be considered valid. I would ask which is more dangerous blocking 5 minutes of w:0 or deleting 3 hours of data for both w:0 and w:1? |
| Comment by Andy Schwerin [ 21/May/14 ] |
|
To be clear, dmurphy, when a client gets a response for a "w:majority" write, only forced reconfigurations should cause that write to roll back. Kenny's case and the proposal I derived from it really only apply to writes confirmed with less than "w:majority", and maybe to giving "w:majority" writes an opportunity to complete and respond before voluntary stepdown. In those cases, while the application should be prepared for those writes to roll back due to failover, it's a favor to the operator not to roll them back during planned step-down. |
| Comment by David Murphy [ 21/May/14 ] |
|
To Kenny's point on a parameter I think it would be great if nodes could veto themselves if they are to old and catching up. I do wonder if we did that would we also need a setting somewhere (conf or config.settings) to set if mongo should allow vs disallow rollbacks. We as a community dont like to have configuration options, however this is a major tunable for determining how HA should work, in that its prolonged lack of primary vs potential data-loss/logical corruption if a rollback is preformed. Andy to your point I think such a setting hold help re-mediate the issue with fire and forget write as you could plan you logic around this. Also I would assert that since its fire and forget and we made no guarantee the DB saved this data a rollback is not "critical" for them in the way the higher level write concerns expect no such data removal since the DB reported a success on saving it to 1+N nodes and or journals. |
| Comment by Andy Schwerin [ 21/May/14 ] |
|
Kenny, I believe you're proposing approximately the following behavior, during operator-driven step downs. In addition to the specifying the duration of the demurral period, the operator specifies the duration of the "catch-up period", during which time the set will not accept writes, but the original node remains primary. When that parameter is specified, the primary waits for secondaries to catch up to the last accepted write. Once a majority of members have oplogged that write, or when the period expires, the primary actually steps down and an election takes place. If the demurral period is shorter than the catch-up period or a "do not step down if secondaries not caught up" flag is set, and sufficient secondaries have not caught up by the end of the catch-up period, then the primary would simply not step down, or would stand for reelection. During the catch-up period, writes would be rejected. There might be some sticky issues around that for fire-and-forget writes, but I haven't yet found a flaw with the core of the algorithm. |
| Comment by Kenny Gorman [ 21/May/14 ] |
|
Another possibility would be to actually provide a parameter to stepDown() asking for the election not to finish until N seconds have elapsed in order for slaves to catch up. If they aren't caught up, then itself is re-elected after that period. Like a "no for serious, don't rollback" mode. |
| Comment by Kenny Gorman [ 21/May/14 ] |
|
Eric, Thanks for the reply. Is there a SERVER ticket for the design? It sounds like the design still won't guarantee we won't rollback. The use case here is that sometimes we call stepDown() on healthy primaries in order to move around the workloads. When we do this, we would like to instruct MongoDB to not tolerate data loss (it shouldn't need to lose data). We need a way to communicate that to MongoDB. The problem is that:
It would be OK to have the stepDown take longer while the soon to be PRIMARY consumes 100% of possible oplog to ensure it's as up to data as possible. Something roughly like: if stepDown() && applyAllOplogMode: Where applyAllOplogMode is set via command line at startup. |
| Comment by Eric Milkie [ 21/May/14 ] |
|
We are planning on changing the order of when the oplog is written versus when the writes are applied, on secondary nodes. This will have a big impact on reducing the amount of potential data rolled back after a primary demotion. |
| Comment by Kenny Gorman [ 20/May/14 ] |
|
Any thoughts? |