This is something Jie Chen noticed while working on enhancing the wt dump command.
At the moment, in __wt_find_hs_upd, once we position the cursor on a record, we check the visibility using the start time pair embedded in the history store key (not the one on the cell itself). If this history store content was written in a previous run, then this transaction id isn't going to be valid and it's possible that we'll fail this visibility check and ignore the record when we should return it.
The obvious solution is just to remove this check and ignore the time pairs embedded in the key. Since we set the stop and start time pairs of each history store entry, the fact that we're able to position the cursor on a given key tells us that it is visible to us. However, it means that history store keys will have stale transaction ids which has some implications. Imagine this scenario for a key:
- Transaction ids 10, 11 and 12 all insert an update at timestamp 5.
- Evict. Id 12's update gets written to the page and ids 11 and 10 get written to the history store.
- Reopen the connection.
- Transaction id 1 inserts an update at timestamp 5 and at timestamp 6.
- Evict. Id 1's update at timestamp 6 gets written to disk and timestamp 5 gets written to history store. The current on-disk value (transaction id 12) will get written to history store too but its transaction ids will get overwritten as appropriate.
- Now, for any search, the order we will traverse history store will be: (txnid=11, ts=5), (txnid=10, ts=5), (txnid=1, ts=5), (txnid=0, ts=5)