In a scenario where the WT is opened multiple times as per the below example sequence provided by michael.cahill.
Imagine this sequence:
wiredtiger_open #1
1 million txns, each appending one k/v pair to a table
conn->close
wiredtiger_open #2
1 txn, appending another record to the same table
conn->close
wiredtiger_open #3
What I'm interested in is how much work has to happen for rollback_to_stable during wiredtiger_open #3.
Imagine at #2 that there are 1000 leaf pages, 10 internal pages at depth 2, then the root page. The checkpoint on the prior close would have recorded the maximum transaction ID for its snapshot (~1,000,000). So rollback_to_stable can look at the aggregated maximum txn ID in the root page and skip doing any work.
Now think about #3: the maximum txn ID on close was ~1 but the maximum aggregated txn ID in the root page will still be ~1,000,000 (right?). Even if the right-most leaf page and internal page in the tree was rewritten, there will still be many higher txn IDs on the earlier pages.
How many pages does the rollback_to_stable have to visit? I think at least it has to visit the internal pages under the root (because they have max txn IDs in the future). Can it then look at the write_gen on the internal pages and skip doing any txn ID checks? Or does it have to visit all the leaf pages as well?
It's probably fine to have rollback_to_stable visit internal pages, but it's worth making sure that's what we're expecting. I don't think it would be okay to visit leaf pages in this case – any quick open/close/open could then cause a huge amount of I/O.