-
Type: Bug
-
Resolution: Fixed
-
Priority: Critical - P2
-
Affects Version/s: None
-
Component/s: None
-
v5.3
/* Ignore prepared updates if it is checkpoint. */ if (upd->prepare_state == WT_PREPARE_LOCKED || upd->prepare_state == WT_PREPARE_INPROGRESS) { WT_ASSERT(session, upd_select->upd == NULL || upd_select->upd->txnid == upd->txnid); if (F_ISSET(r, WT_REC_CHECKPOINT)) { has_newer_updates = true; if (upd->start_ts > max_ts) max_ts = upd->start_ts; /* * Track the oldest update not on the page, used to decide whether reads can use the * page image, hence using the start rather than the durable timestamp. */ if (upd->start_ts < r->min_skipped_ts) r->min_skipped_ts = upd->start_ts; continue; } else { /* * For prepared updates written to the date store in salvage, we write the same * prepared value to the date store. If there is still content for that key left in * the history store, rollback to stable will bring it back to the data store. * Otherwise, it removes the key. */ WT_ASSERT(session, F_ISSET(r, WT_REC_EVICT) || (F_ISSET(r, WT_REC_VISIBILITY_ERR) && F_ISSET(upd, WT_UPDATE_PREPARE_RESTORED_FROM_DS))); WT_ASSERT(session, upd->prepare_state == WT_PREPARE_INPROGRESS); }
With the current implementation, checkpoint may see partial resolved prepared updates on the same key and write that to disk.
The detailed scenario is like follow:
Suppose we have the update chain like U_prepared2@10 -> U_prepared1@10
Checkpoint starts
We commit the prepared update and resolve the U_preapred2 to U_committed@11_durable@12.
Context switch happens and we have U_committed@11_durable@12 -> U_prepared1@10 on the update chain.
Checkpoint comes to the page and sees U_committed@11_durable@12 and decide to write it to the disk image.
Checkpoint sees U_prepared1@10 and set has_newer_updates to true but never unsets the update that should be written to disk (U_committed@11_durable@12).
In this case, we write U_committed@11_durable@12 to the data store and U_prepared1@10 to the history store, which is wrong.