[SERVER-38925] Rollback via refetch can cause _id duplication when enableMajorityReadConcern:false Created: 10/Jan/19 Updated: 29/Oct/23 Resolved: 13/Mar/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Storage |
| Affects Version/s: | 4.1.6 |
| Fix Version/s: | 4.1.9 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | William Schultz (Inactive) | Assignee: | Tess Avitabile (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2019-02-11, Repl 2019-02-25, Repl 2019-03-11, Repl 2019-03-25 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 66 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
When enableMajorityReadConcern:false, we disable journaling of replicated tables and use stable checkpoints and the oplog for crash recovery. Stable checkpoints in this case are not guaranteed to be behind the majority commit point, so we still use the rollbackViaRefetch algorithm. Under this configuration, it is possible for us to create a stable checkpoint whose collection data has two documents with the same _id. Consider the following behavior. Assume all operations are done on the same collection, and that no checkpoints are taken other than the one explicitly forced after rollback:
Because the checkpoint was taken at T=1, it does not include the delete at timestamp 2, so the storage engine includes the document at RecordId(1) in the checkpoint. This document is a duplicate of the document at RecordId(2), so when we crash and recover from this checkpoint, we will have two documents with the same _id in the collection. |
| Comments |
| Comment by William Schultz (Inactive) [ 13/Jan/20 ] |
|
Documenting the high level steps of the algorithm implemented in this ticket for future reference:
|
| Comment by Githook User [ 13/Mar/19 ] |
|
Author: {'name': 'Tess Avitabile', 'username': 'tessavitabile', 'email': 'tess.avitabile@mongodb.com'}Message: |
| Comment by Michael Cahill (Inactive) [ 27/Feb/19 ] |
My reason for asking is that anywhere this is problematic in MongoDB is likely to be problematic for a new WT method that has the same effect as close + reopen, regardless of how it is implemented. As an example, should cached sessions / cursors survive this operation? What are the expected semantics if some other WT method is called while it is in progress? etc. |
| Comment by Daniel Gottlieb (Inactive) [ 27/Feb/19 ] |
My recollection of looking at that was WT glue uses unowned pointers so we'd need safely be able to refresh all of them. Which can be done.
That seems more complex than I would have expected. I myself was comparing it to just |
| Comment by Michael Cahill (Inactive) [ 27/Feb/19 ] |
|
tess.avitabile, I think the thing you're describing above – taking an unstable checkpoint and forgetting the timestamps that were used locally – is already available in WiredTiger by closing and reopening the WT_CONNECTION handle (which causes all other WT handles to be closed). Since we require exclusive access to WiredTiger for rollback_to_stable when EMRC=true, are there dangling references to WiredTiger handles that would prevent a close/reopen when it is false? Building this as online functionality in WiredTiger is likely to be comparable in complexity to reimplementing rollback_to_stable (i.e., PM-715). |
| Comment by Tess Avitabile (Inactive) [ 26/Feb/19 ] |
|
The setting enableMajorityReadConcern:false indicates that the server should not keep history back to the majority commit point. This setting is recommended in replica sets with arbiters, since we allow a primary to continue accepting writes when it cannot majority commit those writes, so the replica set can experience arbitrary majority commit point lag. |
| Comment by Alexander Gorrod [ 26/Feb/19 ] |
|
I'm trying to catch up here, and my first naive question relates to:
Could you remind me why we made that choice? |
| Comment by William Schultz (Inactive) [ 20/Feb/19 ] |
|
Thanks for investigating this. Solution 1 is not my favorite, but if that was the only option hopefully we could make it work. Solution 2 seems a bit harsh in terms of requiring a resync if a node crashes in the recovering window, and would certainly be a regression in behavior. Solutions 3 and 4 seem more promising but as you pointed out we would need to determine their feasibility in terms of storage layer support. In thinking about your solutions one other possibility came to mind, which also may have been touched on in the past. You brought up the issue of how to recover if a crash occurs during RECOVERING. What if, before we exit ROLLBACK, we force an unstable checkpoint, but also persist a document that indicates we are entering rollback recovery. I believe this document would need to include the common point timestamp. If, after entering RECOVERING post ROLLBACK, we crash and restart, we will recover from the unstable checkpoint we took before oplog truncation. If we see the presence of this document, it tells us that it should be safe to replay all oplog entries forward from the common point timestamp to our current top of oplog. This would occur as a part of replication recovery. After this, we would re-enter RECOVERING state and apply until we hit minValid. We would also disable the taking of any checkpoints until we hit a timestamp that is safe to take stable checkpoints at again. I'm not sure if we are averse to persisting more information to disk, but this is a thought that came up. I think something similar may have been mentioned previously by Dan G. As I wrote the above proposal, I also started to wonder why that final pre-truncation checkpoint needs to be an unstable checkpoint instead of a stable one. We want it to be unstable so that it includes all previous writes regardless of their timestamp, but would it be possible to forcibly tell the storage engine to take a stable checkpoint at some maximum possible timestamp? The checkpoint would read at timestamp "infinity", but perhaps we could somehow tell the system to label the checkpoint as if it were taken at the common point timestamp. This way, when entering RECOVERING, any crash and restart would recover from that stable checkpoint, and play oplog entries forward from the common point. |
| Comment by Tess Avitabile (Inactive) [ 20/Feb/19 ] |
The above solution did not make sense because we do not apply oplog entries from the source before truncating oplog and exiting ROLLBACK. Our sequence of operations for rollback with EMRC=false in master is currently as follows (with notes on what happens if we crash and restart at each point):
It is essential that restarts during RECOVERING recover from a stable checkpoint so that they can reapply oplog entries, since data writes during RECOVERING are not logged. Given this, I can think of the following solutions, some of which we have discussed:
|
| Comment by Tess Avitabile (Inactive) [ 13/Feb/19 ] |
|
daniel.gottlieb, william.schultz, I've started working on this ticket, and I've encountered a problem with the design. william.schultz proposed the following algorithm:
The problem I have is this set of constraints:
Can you please weigh in on this issue? The solution that comes to mind is to require that SyncSourceTopOfOplog>=RollbackTopOfOplog before leaving rollback, so that we can set stable_timestamp=initialDataTimestamp. The downside is that we will stay in rollback for longer. |
| Comment by Daniel Gottlieb (Inactive) [ 11/Jan/19 ] |
I think I was being a little silly/paranoid. The rollback via refetch algorithm already forces a checkpoint before coming out of rollback (truncating the oplog entries after the common point/removing the evidence the node has to go through rollback). I suspect the following sequence is sufficient to alleviate the problem. Let T be the timestamp of the sync source the rollback node had to apply up to in order to get a consistent view of data:
|
| Comment by Daniel Gottlieb (Inactive) [ 10/Jan/19 ] |
|
I see the rollback process as similar to initial sync. In initial sync, the algorithm fetches a bunch of documents from a sync source and writes them locally. The versions of those documents (with respect to the oplog history) are otherwise unrelated. Thus, there's a second phase in initial sync, where the syncing node replays oplog up to the sync source's current optime. In the case of initial sync, the storage engine is taking "unstable checkpoints". After the node comes out of initial sync, there's a lastApplied that denotes when the data is consistent. I think making the analogous calls to setInitialDataTimestamp around rollback via refetch will solve the data inconsistency described in this ticket. Though there's still a concern that william.schultz and I talked about. After a node crashes after completing initial sync, but before taking its first "stable checkpoint", its state when restarting will include oplog entries that are not applied to the data, because the oplog is logged in WT, but the data is not. For nodes that come up on data from an "unstable checkpoint", the implied state is that the oplog and the data match. This assumption is made for the upgrade code path between 3.6 and 4.0 binaries. I believe that assumption is also made by 4.0 -> 4.2 upgrade when the 4.0 node was running with majority reads off. In this case regarding crashing after initial sync, the minValid document will see that the "initial sync flag" is still set, because clearing that flag is a timestamped write at the last applied time. With regards to rollback via refetch, in the absence of the existing "min valid" logic from sparing us here, we may need some corollary field/alternative algorithm for knowing there's unapplied oplog entries. |
| Comment by William Schultz (Inactive) [ 10/Jan/19 ] |
|
tess.avitabile While that may fix this case, the tricky part is that at the time of doing the rollback, it's unclear to me how we would actually find the document stored at RecordId=1. It has been deleted and so there should be no index entry for it. Presumably there is some way to scan the record store directly to look for a document with the same _id, and clean it up if found. The record store thinks the record has been deleted, though. The storage engine should have history back to the original insert, so perhaps if we read at the common point timestamp to look for records that have since been deleted, we would be able to recover the proper RecordId. |
| Comment by Tess Avitabile (Inactive) [ 10/Jan/19 ] |
|
Maybe we need an untimestamped write to the document with RecordId=1 as part of the rollback-via-refetch, so that the delete gets included in the checkpoint. |
| Comment by William Schultz (Inactive) [ 10/Jan/19 ] |
|
The writes done in rollbackViaRefetch should be un-timestamped. Even if we did timestamp them, though, with some "appropriate" timestamp, I don't think it's enough to fix this problem. In the scenario described above, if we gave the "fix up" insert write a timestamp of 1 (the common point), for example, that write would still be included in the checkpoint taken after rollback, so we'd still have the _id duplication issue. I feel that the issue is more related to the fact that we write to two distinct record ids, as opposed to a problem with how we timestamp writes. I have not settled on a conclusive way to fix this problem yet, though. |
| Comment by Judah Schvimer [ 10/Jan/19 ] |
|
Are the writes in rollback via refresh given a timestamp? I'm not exactly sure what timestamp we'd want to give them, maybe the common point? |