[SERVER-38925] Rollback via refetch can cause _id duplication when enableMajorityReadConcern:false Created: 10/Jan/19  Updated: 29/Oct/23  Resolved: 13/Mar/19

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: 4.1.6
Fix Version/s: 4.1.9

Type: Task Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: Tess Avitabile (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File rollback_dup_ids_repro.js    
Issue Links:
Depends
Related
related to SERVER-37897 Disable table logging for data files ... Closed
is related to SERVER-42366 When EMRC=false we may set the stable... Closed
is related to SERVER-43356 May fail to recover after a rollbackV... Closed
is related to SERVER-45010 Clean shutdown after rollbackViaRefet... Closed
is related to SERVER-47219 Correct downgrade_after_rollback_via_... Closed
is related to SERVER-48082 WT clean shutdown should do a quick e... Closed
is related to SERVER-48518 Rollback via refetch (EMRC = false) c... Closed
is related to SERVER-45181 Rollback via refetch should set initi... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2019-02-11, Repl 2019-02-25, Repl 2019-03-11, Repl 2019-03-25
Participants:
Linked BF Score: 66

 Description   

When enableMajorityReadConcern:false, we disable journaling of replicated tables and use stable checkpoints and the oplog for crash recovery. Stable checkpoints in this case are not guaranteed to be behind the majority commit point, so we still use the rollbackViaRefetch algorithm. Under this configuration, it is possible for us to create a stable checkpoint whose collection data has two documents with the same _id. Consider the following behavior. Assume all operations are done on the same collection, and that no checkpoints are taken other than the one explicitly forced after rollback:

  1. Insert a document {_id:1} at timestamp T=1, with RecordId=1. Let this write majority commit.
  2. Delete document {_id:1} at timestamp T=2 and assume this write doesn't majority commit.
  3. Enter rollback, where the delete operation at T=2 is the only op necessary to roll back.
  4. To undo the delete operation, rollback refetches the document and inserts a new document {_id:1} with RecordId=2
  5. Complete rollback and set the stable timestamp to T=1, the rollback common point.
  6. Force a new stable checkpoint to be taken. This checkpoint, call it C1, is taken at timestamp T=1.
  7. Shut down uncleanly.
  8. Start up and recover from the most recent stable checkpoint, C1.

Because the checkpoint was taken at T=1, it does not include the delete at timestamp 2, so the storage engine includes the document at RecordId(1) in the checkpoint. This document is a duplicate of the document at RecordId(2), so when we crash and recover from this checkpoint, we will have two documents with the same _id in the collection.



 Comments   
Comment by William Schultz (Inactive) [ 13/Jan/20 ]

Documenting the high level steps of the algorithm implemented in this ticket for future reference:

  1. Start rollback syncFixUp
  2. Set initialDataTimestamp = 0
  3. Do rollback data fix ups
  4. Set stableTimestamp = commonPoint
  5. Set initialDataTimestamp = max(localTopOfOplog, syncSourceTopOfOplog)
  6. Take an unstable checkpoint
  7. Truncate the oplog
  8. Set appliedThrough to the common point
  9. Take an unstable checkpoint
  10. Clear appliedThrough
Comment by Githook User [ 13/Mar/19 ]

Author:

{'name': 'Tess Avitabile', 'username': 'tessavitabile', 'email': 'tess.avitabile@mongodb.com'}

Message: SERVER-38925 When enableMajorityReadConcern=false, after rollback via refetch, do not take stable checkpoint until the local top of oplog from before rollback is reached
Branch: master
https://github.com/mongodb/mongo/commit/84916e817418b3b5627e80730effcd422c15696e

Comment by Michael Cahill (Inactive) [ 27/Feb/19 ]

My recollection of looking at that was WT glue uses unowned pointers so we'd need safely be able to refresh all of them.

My reason for asking is that anywhere this is problematic in MongoDB is likely to be problematic for a new WT method that has the same effect as close + reopen, regardless of how it is implemented. As an example, should cached sessions / cursors survive this operation? What are the expected semantics if some other WT method is called while it is in progress? etc.

Comment by Daniel Gottlieb (Inactive) [ 27/Feb/19 ]

Since we require exclusive access to WiredTiger for rollback_to_stable when EMRC=true, are there dangling references to WiredTiger handles that would prevent a close/reopen when it is false?

My recollection of looking at that was WT glue uses unowned pointers so we'd need safely be able to refresh all of them. Which can be done.

Building this as online functionality in WiredTiger is likely to be comparable in complexity to reimplementing rollback_to_stable (i.e., PM-715).

That seems more complex than I would have expected. I myself was comparing it to just WT-3388, which is admittedly the magic of the rollback half of the project.

Comment by Michael Cahill (Inactive) [ 27/Feb/19 ]

tess.avitabile, I think the thing you're describing above – taking an unstable checkpoint and forgetting the timestamps that were used locally – is already available in WiredTiger by closing and reopening the WT_CONNECTION handle (which causes all other WT handles to be closed).

Since we require exclusive access to WiredTiger for rollback_to_stable when EMRC=true, are there dangling references to WiredTiger handles that would prevent a close/reopen when it is false?  Building this as online functionality in WiredTiger is likely to be comparable in complexity to reimplementing rollback_to_stable (i.e., PM-715).

Comment by Tess Avitabile (Inactive) [ 26/Feb/19 ]

The setting enableMajorityReadConcern:false indicates that the server should not keep history back to the majority commit point. This setting is recommended in replica sets with arbiters, since we allow a primary to continue accepting writes when it cannot majority commit those writes, so the replica set can experience arbitrary majority commit point lag.

Comment by Alexander Gorrod [ 26/Feb/19 ]

I'm trying to catch up here, and my first naive question relates to:

Stable checkpoints in this case are not guaranteed to be behind the majority commit point

Could you remind me why we made that choice?

Comment by William Schultz (Inactive) [ 20/Feb/19 ]

Thanks for investigating this. Solution 1 is not my favorite, but if that was the only option hopefully we could make it work. Solution 2 seems a bit harsh in terms of requiring a resync if a node crashes in the recovering window, and would certainly be a regression in behavior. Solutions 3 and 4 seem more promising but as you pointed out we would need to determine their feasibility in terms of storage layer support. In thinking about your solutions one other possibility came to mind, which also may have been touched on in the past.

You brought up the issue of how to recover if a crash occurs during RECOVERING. What if, before we exit ROLLBACK, we force an unstable checkpoint, but also persist a document that indicates we are entering rollback recovery. I believe this document would need to include the common point timestamp. If, after entering RECOVERING post ROLLBACK, we crash and restart, we will recover from the unstable checkpoint we took before oplog truncation. If we see the presence of this document, it tells us that it should be safe to replay all oplog entries forward from the common point timestamp to our current top of oplog. This would occur as a part of replication recovery. After this, we would re-enter RECOVERING state and apply until we hit minValid. We would also disable the taking of any checkpoints until we hit a timestamp that is safe to take stable checkpoints at again. I'm not sure if we are averse to persisting more information to disk, but this is a thought that came up. I think something similar may have been mentioned previously by Dan G.

As I wrote the above proposal, I also started to wonder why that final pre-truncation checkpoint needs to be an unstable checkpoint instead of a stable one. We want it to be unstable so that it includes all previous writes regardless of their timestamp, but would it be possible to forcibly tell the storage engine to take a stable checkpoint at some maximum possible timestamp? The checkpoint would read at timestamp "infinity", but perhaps we could somehow tell the system to label the checkpoint as if it were taken at the common point timestamp. This way, when entering RECOVERING, any crash and restart would recover from that stable checkpoint, and play oplog entries forward from the common point.

Comment by Tess Avitabile (Inactive) [ 20/Feb/19 ]

The solution that comes to mind is to require that SyncSourceTopOfOplog>=RollbackTopOfOplog before leaving rollback, so that we can set stable_timestamp=initialDataTimestamp.

The above solution did not make sense because we do not apply oplog entries from the source before truncating oplog and exiting ROLLBACK.

Our sequence of operations for rollback with EMRC=false in master is currently as follows (with notes on what happens if we crash and restart at each point):

  1. Roll back operations using untimestamped writes. [If we restart here, we will recover from the last stable checkpoint, which may include some of the untimestamped rollback writes, and then re-enter rollback. This works, except for the duplicate _id problem, because oplog entries and rollback operations are idempotent.]
  2. Set stable timestamp to common point and wait for stable checkpoint. [Restarting here works the same as in (1).]
  3. Truncate oplog to common point. [If we restart here, we recover from the stable checkpoint we took in (2), which includes all the rollback writes.]
  4. Enter RECOVERING and begin oplog application. Oplog entries are logged and data writes are not. [If we restart here, we recover from a stable checkpoint, reapplying oplog entries as needed.]
  5. When the lastApplied reaches minValid, enter SECONDARY. [If we restart here, we recover from a stable checkpoint.]

It is essential that restarts during RECOVERING recover from a stable checkpoint so that they can reapply oplog entries, since data writes during RECOVERING are not logged. Given this, I can think of the following solutions, some of which we have discussed:

  1. Inserts that are done as part of rollback look for documents with the same _id and perform untimestamped deletes of those documents.
  2. Set initial data timestamp to 0 and set initial sync flag when entering ROLLBACK. Before truncating oplog, wait for unstable checkpoint. Before entering RECOVERING, set stable timestamp to common point and initial data timestamp to max(minValid, RollbackTopOfOplog) and then clear initial sync flag. In this solution, if a node restarts after entering ROLLBACK and before reaching max(minValid, RollbackTopOfOplog), it will need a full resync, which is not great.
  3. Similar to (2), but do not set the initial sync flag and somehow log data writes until the stable timestamp reaches the initial data timestamp.
  4. Half-formed idea: An ideal solution would be to take an unstable checkpoint and restart before entering rollback, in order to include all writes in the checkpoint and erase history. Checkpoints could be taken during rollback (stable or unstable is equivalent, since we only do untimestamped writes). Before truncating oplog, we must set the stable timestamp to the common point and take a stable checkpoint, so that all rollback writes are durable and restarts during RECOVERING will recover from a stable checkpoint. Is it possible to emulate this behavior in the server without the restart?
Comment by Tess Avitabile (Inactive) [ 13/Feb/19 ]

daniel.gottlieb, william.schultz, I've started working on this ticket, and I've encountered a problem with the design. william.schultz proposed the following algorithm:

  • Before entering ROLLBACK, set initialDataTimestamp=0.
  • After ROLLBACK, set initialDataTimestamp=max(SyncSourceTopOfOplog, RollbackTopOfOplog), where SyncSourceTopOfOplog is the timestamp of the latest entry we have fetched and applied from our rollback sync source, and RollbackTopOfOplog is the timestamp of the last oplog entry we had before rolling back.
  • Maybe also set stable_timestamp=initialDataTimestamp.

The problem I have is this set of constraints:

  • We must set stable_timestamp<=SyncSourceTopOfOplog because when we enter SECONDARY, we will do timestamped writes starting at SyncSourceTopOfOplog, and we cannot do timestamped writes behind the stable_timestamp.
  • We must set stable_timestamp=initialDataTimestamp and take a stable checkpoint upon leaving rollback. Otherwise, if we crash and recover from an unstable checkpoint once in SECONDARY, we will assume that the oplog and data match, as daniel.gottlieb described above.

Can you please weigh in on this issue? The solution that comes to mind is to require that SyncSourceTopOfOplog>=RollbackTopOfOplog before leaving rollback, so that we can set stable_timestamp=initialDataTimestamp. The downside is that we will stay in rollback for longer.

Comment by Daniel Gottlieb (Inactive) [ 11/Jan/19 ]

Though there's still a concern

I think I was being a little silly/paranoid. The rollback via refetch algorithm already forces a checkpoint before coming out of rollback (truncating the oplog entries after the common point/removing the evidence the node has to go through rollback).

I suspect the following sequence is sufficient to alleviate the problem. Let T be the timestamp of the sync source the rollback node had to apply up to in order to get a consistent view of data:

  1. Advance the initial data timestamp to T
  2. Advance the stable timestamp to T (currently the code sets the stable point to the common point)
  3. waitUntilUnjournaledWritesDurable (already exists)
Comment by Daniel Gottlieb (Inactive) [ 10/Jan/19 ]

I see the rollback process as similar to initial sync. In initial sync, the algorithm fetches a bunch of documents from a sync source and writes them locally. The versions of those documents (with respect to the oplog history) are otherwise unrelated. Thus, there's a second phase in initial sync, where the syncing node replays oplog up to the sync source's current optime.

In the case of initial sync, the storage engine is taking "unstable checkpoints". After the node comes out of initial sync, there's a lastApplied that denotes when the data is consistent.

I think making the analogous calls to setInitialDataTimestamp around rollback via refetch will solve the data inconsistency described in this ticket.

Though there's still a concern that william.schultz and I talked about. After a node crashes after completing initial sync, but before taking its first "stable checkpoint", its state when restarting will include oplog entries that are not applied to the data, because the oplog is logged in WT, but the data is not. For nodes that come up on data from an "unstable checkpoint", the implied state is that the oplog and the data match. This assumption is made for the upgrade code path between 3.6 and 4.0 binaries. I believe that assumption is also made by 4.0 -> 4.2 upgrade when the 4.0 node was running with majority reads off.

In this case regarding crashing after initial sync, the minValid document will see that the "initial sync flag" is still set, because clearing that flag is a timestamped write at the last applied time. With regards to rollback via refetch, in the absence of the existing "min valid" logic from sparing us here, we may need some corollary field/alternative algorithm for knowing there's unapplied oplog entries.

Comment by William Schultz (Inactive) [ 10/Jan/19 ]

tess.avitabile While that may fix this case, the tricky part is that at the time of doing the rollback, it's unclear to me how we would actually find the document stored at RecordId=1. It has been deleted and so there should be no index entry for it. Presumably there is some way to scan the record store directly to look for a document with the same _id, and clean it up if found. The record store thinks the record has been deleted, though. The storage engine should have history back to the original insert, so perhaps if we read at the common point timestamp to look for records that have since been deleted, we would be able to recover the proper RecordId.

Comment by Tess Avitabile (Inactive) [ 10/Jan/19 ]

Maybe we need an untimestamped write to the document with RecordId=1 as part of the rollback-via-refetch, so that the delete gets included in the checkpoint.

daniel.gottlieb

Comment by William Schultz (Inactive) [ 10/Jan/19 ]

The writes done in rollbackViaRefetch should be un-timestamped. Even if we did timestamp them, though, with some "appropriate" timestamp, I don't think it's enough to fix this problem. In the scenario described above, if we gave the "fix up" insert write a timestamp of 1 (the common point), for example, that write would still be included in the checkpoint taken after rollback, so we'd still have the _id duplication issue. I feel that the issue is more related to the fact that we write to two distinct record ids, as opposed to a problem with how we timestamp writes. I have not settled on a conclusive way to fix this problem yet, though.

Comment by Judah Schvimer [ 10/Jan/19 ]

Are the writes in rollback via refresh given a timestamp? I'm not exactly sure what timestamp we'd want to give them, maybe the common point?

Generated at Thu Feb 08 04:50:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.