[SERVER-46678] Preserve durable history across restarts Created: 06/Mar/20  Updated: 29/Oct/23  Resolved: 08/Jan/21

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Task Priority: Major - P3
Reporter: Eric Milkie Assignee: Daniel Gottlieb (Inactive)
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-data-clone
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on WT-6331 Set oldest timestamp on startup of Wi... Closed
is depended on by SERVER-53516 Have donor shards re-pin oldest_times... Closed
Backwards Compatibility: Fully Compatible
Sprint: Execution Team 2020-04-06, Execution Team 2020-10-05, Execution Team 2020-10-19, Sharding 2020-12-28, Execution Team 2020-11-02, Execution Team 2020-11-16, Sharding 2021-01-11, Sharding 2021-01-25
Participants:

 Description   

Today, on startup, MongoDB sets the oldest timestamp in such a way as to cause WiredTiger to remove all existing durable history in the history file. This ticket is to change that behavior such that the history in the history file is preserved on startup.



 Comments   
Comment by Githook User [ 08/Jan/21 ]

Author:

{'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}

Message: SERVER-46678: Utilize durable history across restarts.
Branch: master
https://github.com/mongodb/mongo/commit/ca040c1f469aa0ffd68e7a0605c10145e7fb65dc

Comment by Daniel Gottlieb (Inactive) [ 11/Nov/20 ]

I had a talk with geert.bosch and louis.williams. We believe that by sacrificing precision of the minimum visible timestamp, we have a relatively low effort low risk way starting up with legal values for the minimum visible timestamp (i.e: no new data needs to be written out, eliminating any upgrade/downgrade work).

Today on startup, the oldest and stable timestamp are set to the checkpoints recovery timestamp (effectively the stable timestamp as of that checkpoint). Today at startup, none of the collections will have a minVisibleTimestamp set. The proposed algorithm would be:

  • On startup, query the oldest timestamp (which is now populated via WT-6331)
  • Perform a read on the _mdb_catalog with a read_timestamp of the oldest timestamp.
  • Note all collection UUIDs in that read.
  • When constructing Collection objects, we will set the minimum visible timestamp to one of two values:
    • For collections who's UUID was found in a read at the oldest_timestamp, the minimum visible timestamp will remain unset. New readers will be able to do a historical read after a restart and see these collections.
    • For collections who's UUID was not* found, they're minimum visible timestamp will be set to the recovery timestamp (the timestamp of the checkpoint the process is starting against).
  • For indexes, the _id index's minimum visible timestamp will remain unset for collections available at the oldest_timestamp.
  • All other (secondary) indexes' minimum visible timestamps will be set to the recovery timestamp.
Comment by Eric Milkie [ 18/Aug/20 ]

At the moment, WiredTiger now does preserve history across restarts, but MongoDB sets its oldest timestamp at startup to clear out all preserved history. In order to not clear out preserved history at startup, MongoDB will need to change how we do minimum visible timestamps for collections, as Max and Dan have mentioned in the above comment.

Comment by Daniel Gottlieb (Inactive) [ 18/Aug/20 ]

max.hirschhorn noticed that because the minimum visible timestamp is not preserved across restart, it may not be correct for MDB to simply use the oldest timestamp that WT provides. I think he's right – we'll need catalog versioning (or persistence of the minimum visible timestamp/index build completion times).

Comment by Vamsi Boyapati [ 07/Apr/20 ]

We had discussed this earlier and it is captured in WT-5539 and WT-5679. No work is done, I have scheduled WT-5679 in current sprint.

Comment by Alexander Gorrod [ 06/Apr/20 ]

haribabu.kommi and vamsi.krishna we talked about introducing a mechanism to facilitate this, but I don't remember how far we got. I think we were going to remember the oldest timestamp serviced by each checkpoint, and find the oldest global checkpoint. Did we do that work? If so is there a simple way we could expose the oldest available timestamp for reads after a restart?

Comment by Daniel Gottlieb (Inactive) [ 03/Apr/20 ]

The last time I looked into this, WT does not track what a legal oldest_timestamp is across restarts. As in, a WT program can set oldest + stable timestamp to 100, restart, set the oldest timestamp to 50 and perform a read at time 50 and possibly be returned wrong data instead of an error. In the absence of the application writing down the oldest timestamps it has informed WT of, the application must reset the oldest_timestamp to the restarted data's recovery timestamp (stable timestamp at shutdown).

A MongoDB-only change to preserve history across restarts would probably be of the form:

  • Each time the oldest timestamp is about to be updated:
    • Write the new value to disk
    • Inform WT of the new value

On restart, the value read from disk is guaranteed to be a valid oldest_timestamp. The corollary solution in WT would be:

  • Each time WT is about to vacuum some data out of the history store (up through potentially time T, the oldest timestamp?):
    • Write T to disk
    • Start the vacuuming process

Today, MongoDB updates the oldest timestamp very frequently. I expect MongoDB updates the oldest timestamp much more frequently than WT vacuums history. I suspect the proposed MongoDB algorithm would cause problems due to excessive (albeit small) writes to disk. Alternatively, MongoDB could slow down how often it informs WT of a new oldest timestamp (reduces the writes MongoDB makes, but limits WTs ability to batch/optimize its vacuuming process).

alexander.gorrod is there an existing WT ticket aimed at protecting users against setting the oldest timestamp across restarts to an illegal value?

Generated at Thu Feb 08 05:12:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.