[SERVER-16570] writeConcerns could be erroneously satisfied after a rollback occurs Created: 16/Dec/14 Updated: 26/Aug/15 Resolved: 26/Aug/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.8.0-rc2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matt Dannenberg | Assignee: | Matt Dannenberg |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
After a rollback, a node's replication progress could be behind where other nodes believe it to be, because the writeConcern code does not have support for replication progress being undone. As a result of this, getLastError for ops that occurred between the point of diversion and the latestOp for the rolled-back node could be satisfied with one too few nodes. No test has been written for this case yet. But I believe one could be written that is a copy of replsets/rollback2.js where each node is replaced with a pair of nodes, and one of the to-be-rolled-back nodes does not apply operations it has copied from the PRIMARY (failpoint: rsSyncApplyStop), once they have been isolated. This node will have the replication progress of the to-be-rolled-back PRIMARY, but will not have the ops in its own oplog. Remove the PRIMARY, bring back the two not-going-to-roll-back nodes, wait for them to receive the replication progress of the to-be-rolled-back PRIMARY, then remove the other to-be-rolled-back node. After a new PRIMARY is elected, do one new write. We should then be able to erroneously satisfy writeConcern w:3 for any optime between the point of divergence and the to-be-rolled-back nodes' last optime. |
| Comments |
| Comment by Matt Dannenberg [ 26/Aug/15 ] |
|
This can no longer happen as a result of the j:true writeConcern work. |
| Comment by Charlie Page [ 15/Apr/15 ] |
|
asya yes, I think what we are looking for is a write as defined by consensus theory with means commit to durable storage. |
| Comment by Asya Kamsky [ 15/Apr/15 ] |
|
charlie.page@10gen.com it looks like So this ticket then is about being able to differentiate timelines, which I think may be subsumed by |
| Comment by Charlie Page [ 18/Dec/14 ] |
|
Let us consider a cluster where this has happened and a network partition has happened. There is a 3 node replica set. If node 2 (the rollback node) isn't allowed to participate in elections (which it can do by not responding to election RPCs) until it has caught up with a primary in its current term, or a later term, then this won't happen without manual intervention. (If the max oplog time is written to disk prior to ack'ing the write, then even without journaling we can know we've rolled back state on retarted. If nothing durable is noted then we silently rollback.) Another (longer term) solution is to convert the oplog to a journal format so writes are only performed a) once b) are only compressed on disk and c) acknowledges writes when they are made durable (such that they can be applied to the database on startup). The problem with have rollback, ever (i.e. in GLE, etc), is that there is always an issue. Rollback means reverting to a state we've previously acknowledged being (which may or may not include an actual rollback). Allowing the status to rollback still doesn't account for the situation where the client is informed between 2a and 3. The system needs to be able to only make forward progress without human intervention (the assumption being that humans then understand they are potentially causing data loss). |
| Comment by Andy Schwerin [ 17/Dec/14 ] |
|
For this to occur, a newly elected primary must receive a request to wait for replication to reach an optime on the new timeline that took place before the newest date on the old timeline. I wonder if composing optimes of term and offset could be used to distinguish this case and make the problem go away... |